Change files

The scripts in this section perform simple transformations of entire files or lines in files.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Change columns in each line of files

Change tabular data with a given separator to tab-separated values (change_any_separator_to_tab)

This tool is important because most of the Scriptome tools require tab-separated data.

Warning: a few weird separators (like ' or ``) might not work. (It might help to put a backslash before it.)

perl -e ' $sep=","; while(<>) { s/\Q$sep\E/\t/g; print $_; } warn "Changed $sep to tab on $. lines\n" ' file.csv > file.tab

Example: Change comma-separated file.csv to tab-separated file.tab by running the above script.

Input file (`file.csv`)	Output file (`file.tab`)	Screen Output
Fly,7 Human,14 Worm,28 Yeast,35	Fly 7 Human 14 Worm 28 Yeast 35	Changed , to tab on 4 lines

Example 2: Given a list of Swiss-Prot identifiers, separate the protein name and species abbreviation into two separate columns. Run the above script using $sep="_"

Change tab-separated data to use a different delimiter (change_tab_to_any_separator)

Replace the tabs in tab-separated data with some other separator. The separator does not have to be one character: "---" would work, for example, or even "" to merge all columns.

This tool is important because most of the Scriptome tools require tab-separated data. After running one or more Scriptome tools, use this script to export data back to other programs which expect comma-separated data, for example.

Warning: a few weird separators (like ' or ``) might not work. Also, if there's a comma in your data, and you change to a comma separator, you'll get too many columns.

perl -e ' $sep=","; while(<>) { s/\t/$sep/g; print $_; } warn "Changed tab to $sep on $. lines\n" ' file.tab > file.csv

Example: Change tab-separated file.tab to comma-separated file.csv by running the above script.

Input file (`file.tab`)	Output file (`file.csv`)	Screen Output
Fly 7 Human 14 Worm 28 Yeast 35	Fly,7 Human,14 Worm,28 Yeast,35	Changed tab to , on 4 lines

Reorder columns

Use choose_cols. Choose some or all of the columns, in whatever order you want. To switch the order of the first two columns of a ten-column file, you could set @cols to be 1, 0, 2..9.

Change entire lines in files

Remove spaces in a line (change_remove_spaces)

Remove all spaces (but not tabs) from a line.

perl -e ' while(<>) { s/ //g; print $_; } warn "Removed all spaces from $. lines\n" ' file.spaces > file.nospace

Example: TODO

See Also: remove empty lines (Choose)

Change all characters to upper case (change_upper_case)

Change all characters in each line to upper case. Numbers and punctuation will not be changed. (Change "uc" in the script to "lc" to get lower case.)

perl -e ' while(<>) { print uc($_); } warn "Changed $. lines to upper case\n" ' file.mixed > file.uc

Example: Change a list of gene names to upper-case, to compare with another list.

Change between different biological data formats

Change a FASTA file into tabular format (change_fasta_to_tab)

Change each FASTA sequence in a file into one line of three, tab-separated columns: the ID (not including the '>'); the rest of the description line (or an empty column if the description line contains only an ID); and the sequence itself.

Once you have run this script, you can use the many Scriptome tools that work on tab-separated data.

Note: translating to FASTA format and back will generate a file with the same information, but the files may not be identical. This tool will replace any tabs with single spaces (otherwise the tabular output file will have too many columns) and removes any spaces from the amino acid or nucleic acid sequences.

perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) { print "\n" } s/ |$/\t/; $count++; $_ .= "\t"; } else { s/ //g; $len += length($_) } print $_; } print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n"; ' seqs.fna > seqs.tab

Example: Run the above script on seqs.fna to get seqs.tab.

Input file (seqs.fna) Output file (seqs.tab) Screen Output

Input file (`seqs.fna`)	Output file (`seqs.tab`)	Screen Output
>CG123 A small sequence ACGTTGCA GTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG	CG123 A small sequence ACGTTGCAGTTACCAG EG12 ACCGGA DG124 A smaller sequence GTTACCAG	Converted 3 FASTA records in 7 lines to tabular format Total sequence length: 30

 >CG123 A small sequence
 ACGTTGCA
 GTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG

 CG123  A small sequence        ACGTTGCAGTTACCAG
 EG12   ACCGGA
 DG124   A smaller sequence     GTTACCAG

 
 Converted 3 FASTA records in 7 lines to tabular format
 Total sequence length: 30

Change a tabular file into FASTA format (change_tab_to_fasta)

Change each line in a three-column, tab-separated file (containing ID, description and sequence - e.g., a file created by the above change_tab_to_fasta tool) to FASTA sequence.

Note: translating to FASTA format and back will generate a file with the same information, but the files may not be identical. This tool will put a single space between the ID and the description, and will put 60 characters per line in the sequence portion.

perl -e ' $len=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) { print " $F[1]" } print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n"; } warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n"; ' seqs.tab >seqs.fasta

Example: Run the above script on seqs.tab to get seqs.fasta.

Input file (seqs.tab) Output file (seqs.fasta) Screen Output

Input file (`seqs.tab`)	Output file (`seqs.fasta`)	Screen Output
CG123 A small sequence ACGTTGCAGTTACCAG EG12 ACCGGA DG124 A smaller sequence GTTACCAG	>CG123 A small sequence ACGTTGCAGTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG	Converted 3 tab-delimited lines to FASTA format Total sequence length: 30

 CG123  A small sequence        ACGTTGCAGTTACCAG
 EG12           ACCGGA
 DG124   A smaller sequence     GTTACCAG

 >CG123 A small sequence
 ACGTTGCAGTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG

 
 Converted 3 tab-delimited lines to FASTA format
 Total sequence length: 30

Change from one biological format to another (change_bio_format_to_bio_format)

Change files with one or more sequences into a different format. The input and output formats can be embl, fasta, gcg, genbank, swiss, or a whole bunch of other formats: see

The Bioperl SeqIO HOWTO

for details.

Warning: Converting from genbank to FASTA (for example) will necessarily lose some annotation information.

This script requires Bioperl to be installed (on whichever machine the script runs on). Many biology computers will have it installed. If the script breaks because it "can't locate Bio/Perl.pm", you can download Bioperl from bioperl.org.

perl -MBio::SeqIO -e ' $informat="genbank"; $outformat="fasta"; $count = 0; for $infile (@ARGV) { $in = Bio::SeqIO->newFh(-file => $infile , -format => $informat); $out = Bio::SeqIO->newFh(-format => $outformat); while (<$in>) { print $out $_; $count++; } } warn "Translated $count sequences from $informat to $outformat format\n" ' myseqs.genbank > myseqs.fasta

Example: TODO

Change entire files

Transpose a table (change_transpose_table)

Change rows to columns and vice versa, for a tab-separated file. Data should have the same number of columns in every row.

perl -e ' $unequal=0; $_=<>; s/\r?\n//; @out_rows = split /\t/, $_; $num_out_rows = $#out_rows+1; while(<>) { s/\r?\n//; @F = split /\t/, $_; foreach $i (0 .. $#F) { $out_rows[$i] .= "\t$F[$i]"; } if ($num_out_rows != $#F+1) { $unequal=1; } } END { foreach $row (@out_rows) { print "$row\n" } warn "\nWARNING! Rows in input had different numbers of columns\n" if $unequal; warn "\nTransposed table: result has $. columns and $num_out_rows rows\n\n" } ' original.tab > transposed.tab

Example: Transpose the table original.tab to get transposed.tab.

Input file (`original.tab`)	Top Col2 Col3 Row2 r2c2 r2c3 Row3 r3c2 r3c3 Row4 r4c2 r4c3 Row5 r5c2 r5c3
Output file (`transposed.tab`)	Top Row2 Row3 Row4 Row5 Col2 r2c2 r3c2 r4c2 r5c2 Col3 r2c3 r3c3 r4c3 r5c3
Screen Output	Transposed table: result has 5 columns and 3 rows

Split big FASTA file into smaller files (change_split_fasta)

Split one big FASTA file into multiple smaller ones. If the output filename template is small_NUMBER.fasta, the output files will be called small_1.fasta, small_2.fasta, etc.

perl -e ' $split_seqs=3; $out_template="small_NUMBER.fasta"; $count=0; $filenum=0; $len=0; while (<>) { s/\r?\n//; if (/^>/) { if ($count % $split_seqs == 0) { $filenum++; $filename = $out_template; $filename =~ s/NUMBER/$filenum/g; if ($filenum > 1) { close SHORT } open (SHORT, ">$filename") or die $!; } $count++; } else { $len += length($_) } print SHORT "$_\n"; } close(SHORT); warn "\nSplit $count FASTA records in $. lines, with total sequence length $len\nCreated $filenum files like $filename\n\n"; ' big.fasta

Example: Split big.fasta, with five sequences, into two files, small_1.fasta and small_2.fasta. (Since there are only five sequences, the second file has only two sequences in it.)

Input file (`big.fasta`)	>seq1 ACCTTGTCGCA >seq2 ACCTTGTCGCAAAGC >seq3 ACCTTGTCGCACCGGAACGA >seq4 ACCTTGTCGCACCGGAACGACCGGAACGA >seq5 GTCGCA
Output file 1 (`small_1.fasta`)	>seq1 ACCTTGTCGCA >seq2 ACCTTGTCGCAAAGC >seq3 ACCTTGTCGCACCGGAACGA
Output file 2 (`small_2.fasta`)	>seq4 ACCTTGTCGCACCGGAACGACCGGAACGA >seq5 GTCGCA
Screen Output	Split 5 FASTA records in 10 lines, with total sequence length 81 Created 2 files like small_2.fasta

$split_seqs		Number of sequences per output file
$out_template		Template for output file name
Input file(s)

evolution & bioinformatics

Tuesday, November 13, 2012