Tuesday, November 13, 2012



Sequence Analysis Protocols

Protocols for data manipulation related to sequence analysis.
To use a protocol, just cut and paste the scripts in the colored/dashed-line boxes onto a UNIX/Mac or Windows command line. You can copy and run the individual tools one by one, or all at once if you're feeling lucky.
If you're happy with the filenames given in the protocol, just make sure that your starting file(s) have the right name. Otherwise, you can copy the whole protocol into an editor (vi, Notepad) and change the filenames, then cut and paste onto the command line.

Analyzing BLAST Results

Reciprocal Best Hit BLAST

A method for finding potential orthologs (evolutionarily related sequences).
Input: files A_B and B_A contain all-against-all blasts of genome A against genome B and vice versa. The blasts should be done with the -m8 option for tabular output.
Tool NameInput filesOutput
choose_lines_with_max_per_nameA_BA_B.best
choose_lines_with_max_per_nameB_AB_A.best
merge_lines_based_on_shared_columnA_B.best B_A.bestA_B_A
choose_lines_col_m_equals_col_n_alphaA_B_AA_B_A.recip
choose_colsA_B_A.recipRBHB.out
Note: if there are multiple hits with the same score for a given query, you may get multiple "best hits". In that case, you may want to use a tool to remove duplicates in the query columns.
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' A_B > A_B.best
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' B_A > B_A.best
perl -e '$col1=1; $col2=0;' -e '($f1,$f2)=@ARGV; open(F1,$f1); while (<F1>) {s/\r?\n//; @F=split /\t/, $_; $line1{$F[$col1]} .= "$_\n"} open(F2,$f2); while (<F2>) {s/\r?\n//;@F=split /\t/, $_; if ($x = $line1{$F[$col2]}) {$x =~ s/\n/\t$_\n/g; print $x}}' A_B.best B_A.best > A_B_A
perl -e '$colm=0; $coln=13; $count=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) {print "$_\n"; $count++}} warn "\nChose $count lines out of $. where column $colm had same text as column $coln\n\n";' A_B_A > A_B_A.recip
perl -e '@cols=(0, 1, 11); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' A_B_A.recip > RBHB.out

Formatting and Filtering FASTA Files

If you change FASTA files to tabular format, you can use all of the tools that filter and format tabular data, and change back to FASTA format at the end if desired.

Remove duplicate sequences in FASTA

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. This protocol removes duplicate sequences based on the nucleotide / amino acid sequence, rather than ID.
Tool NameInput filesOutput
change_fasta_to_tabdup.fastadup.tab
choose_unique_lines_by_coldup.tabunique.tab
change_tab_to_fastaunique.tabunique.fasta
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'dup.fasta > dup.tab
perl -e '$column = 2; $unique=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) {print "$_\n"; $unique++}} warn "\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n"' dup.tab > unique.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' unique.tab > unique.fasta

Remove short sequences in FASTA

Some FASTA files have very short sequences (e.g., with N's or repeat elements removed), that will not provide meaningful results when running BLAST or other analyses. This protocol removes sequences whose lengths are less than 30 nucleotides / amino acids. (In order to change the length, cut and paste the whole protocol into an editor and change the $limit parameter in the second tool.)
Tool NameInput filesOutput
change_fasta_to_tablong_short.fastalong_short.tab
calc_col_lengthlong_short.tabls_length.tab
choose_lines_col_more_than_limitls_length.tabonly_long.tab
choose_colsonly_long.tabonly_long_three_cols.tab
change_tab_to_fastaonly_long_three_cols.tabonly_long.fasta
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'long_short.fasta > long_short.tab
perl -e '$col = 2;' -e 'while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n";' long_short.tab > ls_length.tab
perl -e '$col=3; $limit=30; while(<>) {BEGIN {$count=0} s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) {$count++; print "$_\n"}} warn "\nChose $count lines out of $..\n\n"' ls_length.tab > only_long.tab
perl -e '@cols=(0, 1, 2); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' only_long.tab > only_long_three_cols.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";'only_long_three_cols.tab > only_long.fasta

Change FASTA files to have sixty characters per line of sequence

FASTA files may have entire sequences (hundreds or thousands of characters) in a single line. A few programs cannot except these long sequence lines. To change a FASTA file to have only 60 characters per line, all you need to do is translate the FASTA file to tabular format and back.
Tool NameInput filesOutput
change_fasta_to_tablong_seq.fastalong_seq.tab
change_tab_to_fastalong_seq.tabshort_seq.fasta
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'long_seq.fasta > long_seq.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' long_seq.tab > short_seq.fasta

TODO: Choose FASTA sequences containing an exact subsequence

Note: If the ID or description column happen to contain the sequence, then choose_lines_matching_text will find them. This shouldn't happen very often. It would be nicer if you could search within a column, though

Change files

The scripts in this section perform simple transformations of entire files or lines in files.
To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.

Change columns in each line of files

Change tabular data with a given separator to tab-separated values (change_any_separator_to_tab)

This tool is important because most of the Scriptome tools require tab-separated data.
Warning: a few weird separators (like ' or ``) might not work. (It might help to put a backslash before it.)
$sepCharacter(s) to change to tab
Input file(s)
Output file
perl -e ' $sep=","; while(<>) { s/\Q$sep\E/\t/g; print $_; } warn "Changed $sep to tab on $. lines\n" ' file.csv > file.tab
Example: Change comma-separated file.csv to tab-separated file.tab by running the above script.
Input file (file.csv)Output file (file.tab)Screen Output
 Fly,7
 Human,14
 Worm,28
 Yeast,35
 Fly    7
 Human  14
 Worm   28
 Yeast  35
 Changed , to tab on 4 lines
Example 2: Given a list of Swiss-Prot identifiers, separate the protein name and species abbreviation into two separate columns. Run the above script using $sep="_"

Change tab-separated data to use a different delimiter (change_tab_to_any_separator)

Replace the tabs in tab-separated data with some other separator. The separator does not have to be one character: "---" would work, for example, or even "" to merge all columns.
This tool is important because most of the Scriptome tools require tab-separated data. After running one or more Scriptome tools, use this script to export data back to other programs which expect comma-separated data, for example.
Warning: a few weird separators (like ' or ``) might not work. Also, if there's a comma in your data, and you change to a comma separator, you'll get too many columns.
$sepCharacter(s) to change tab to
Input file(s)
Output file
perl -e ' $sep=","; while(<>) { s/\t/$sep/g; print $_; } warn "Changed tab to $sep on $. lines\n" ' file.tab > file.csv
Example: Change tab-separated file.tab to comma-separated file.csv by running the above script.
Input file (file.tab)Output file (file.csv)Screen Output
 Fly    7
 Human  14
 Worm   28
 Yeast  35
 Fly,7
 Human,14
 Worm,28
 Yeast,35
 Changed tab to , on 4 lines

Reorder columns

Use choose_cols. Choose some or all of the columns, in whatever order you want. To switch the order of the first two columns of a ten-column file, you could set @cols to be 1, 0, 2..9.

Change entire lines in files

Remove spaces in a line (change_remove_spaces)

Remove all spaces (but not tabs) from a line.
Input file(s)
Output file
perl -e ' while(<>) { s/ //g; print $_; } warn "Removed all spaces from $. lines\n" ' file.spaces > file.nospace
Example: TODO
See Also: remove empty lines (Choose)

Change all characters to upper case (change_upper_case)

Change all characters in each line to upper case. Numbers and punctuation will not be changed. (Change "uc" in the script to "lc" to get lower case.)
Input file(s)
Output file
perl -e ' while(<>) { print uc($_); } warn "Changed $. lines to upper case\n" ' file.mixed > file.uc
Example: Change a list of gene names to upper-case, to compare with another list.

Change between different biological data formats

Change a FASTA file into tabular format (change_fasta_to_tab)

Change each FASTA sequence in a file into one line of three, tab-separated columns: the ID (not including the '>'); the rest of the description line (or an empty column if the description line contains only an ID); and the sequence itself.
Once you have run this script, you can use the many Scriptome tools that work on tab-separated data.
Note: translating to FASTA format and back will generate a file with the same information, but the files may not be identical. This tool will replace any tabs with single spaces (otherwise the tabular output file will have too many columns) and removes any spaces from the amino acid or nucleic acid sequences.
Input file(s)
Output file
perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) { print "\n" } s/ |$/\t/; $count++; $_ .= "\t"; } else { s/ //g; $len += length($_) } print $_; } print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n"; ' seqs.fna > seqs.tab
Example: Run the above script on seqs.fna to get seqs.tab.
Input file (seqs.fna)Output file (seqs.tab)Screen Output
 >CG123 A small sequence
 ACGTTGCA
 GTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG
 CG123  A small sequence        ACGTTGCAGTTACCAG
 EG12   ACCGGA
 DG124   A smaller sequence     GTTACCAG
 
 Converted 3 FASTA records in 7 lines to tabular format
 Total sequence length: 30

Change a tabular file into FASTA format (change_tab_to_fasta)

Change each line in a three-column, tab-separated file (containing ID, description and sequence - e.g., a file created by the above change_tab_to_fasta tool) to FASTA sequence.
Note: translating to FASTA format and back will generate a file with the same information, but the files may not be identical. This tool will put a single space between the ID and the description, and will put 60 characters per line in the sequence portion.
Input file(s)
Output file
perl -e ' $len=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) { print " $F[1]" } print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n"; } warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n"; ' seqs.tab >seqs.fasta
Example: Run the above script on seqs.tab to get seqs.fasta.
Input file (seqs.tab)Output file (seqs.fasta)Screen Output
 CG123  A small sequence        ACGTTGCAGTTACCAG
 EG12           ACCGGA
 DG124   A smaller sequence     GTTACCAG
 >CG123 A small sequence
 ACGTTGCAGTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG
 
 Converted 3 tab-delimited lines to FASTA format
 Total sequence length: 30

Change from one biological format to another (change_bio_format_to_bio_format)

Change files with one or more sequences into a different format. The input and output formats can be embl, fasta, gcg, genbank, swiss, or a whole bunch of other formats: see
The Bioperl SeqIO HOWTO
for details.
Warning: Converting from genbank to FASTA (for example) will necessarily lose some annotation information.
This script requires Bioperl to be installed (on whichever machine the script runs on). Many biology computers will have it installed. If the script breaks because it "can't locate Bio/Perl.pm", you can download Bioperl from bioperl.org.
$informatInput file format
$outformatOutput file format
Input file(s)
Output file
perl -MBio::SeqIO -e ' $informat="genbank"; $outformat="fasta"; $count = 0; for $infile (@ARGV) { $in = Bio::SeqIO->newFh(-file => $infile , -format => $informat); $out = Bio::SeqIO->newFh(-format => $outformat); while (<$in>) { print $out $_; $count++; } } warn "Translated $count sequences from $informat to $outformat format\n" ' myseqs.genbank > myseqs.fasta
Example: TODO

Change entire files

Transpose a table (change_transpose_table)

Change rows to columns and vice versa, for a tab-separated file. Data should have the same number of columns in every row.
Input file(s)
Output file
perl -e ' $unequal=0; $_=<>; s/\r?\n//; @out_rows = split /\t/, $_; $num_out_rows = $#out_rows+1; while(<>) { s/\r?\n//; @F = split /\t/, $_; foreach $i (0 .. $#F) { $out_rows[$i] .= "\t$F[$i]"; } if ($num_out_rows != $#F+1) { $unequal=1; } } END { foreach $row (@out_rows) { print "$row\n" } warn "\nWARNING! Rows in input had different numbers of columns\n" if $unequal; warn "\nTransposed table: result has $. columns and $num_out_rows rows\n\n" } ' original.tab > transposed.tab
Example: Transpose the table original.tab to get transposed.tab.
Input file
(original.tab)
 Top    Col2    Col3
 Row2   r2c2    r2c3
 Row3   r3c2    r3c3
 Row4   r4c2    r4c3
 Row5   r5c2    r5c3
Output file
(transposed.tab)
 Top    Row2    Row3    Row4    Row5
 Col2   r2c2    r3c2    r4c2    r5c2
 Col3   r2c3    r3c3    r4c3    r5c3
Screen Output
 
 Transposed table: result has 5 columns and 3 rows

Split big FASTA file into smaller files (change_split_fasta)

Split one big FASTA file into multiple smaller ones. If the output filename template is small_NUMBER.fasta, the output files will be called small_1.fastasmall_2.fasta, etc.
$split_seqsNumber of sequences per output file
$out_templateTemplate for output file name
Input file(s)
perl -e ' $split_seqs=3$out_template="small_NUMBER.fasta"; $count=0; $filenum=0; $len=0; while (<>) { s/\r?\n//; if (/^>/) { if ($count % $split_seqs == 0) { $filenum++; $filename = $out_template; $filename =~ s/NUMBER/$filenum/g; if ($filenum > 1) { close SHORT } open (SHORT, ">$filename") or die $!; } $count++; } else { $len += length($_) } print SHORT "$_\n"; } close(SHORT); warn "\nSplit $count FASTA records in $. lines, with total sequence length $len\nCreated $filenum files like $filename\n\n"; ' big.fasta
Example: Split big.fasta, with five sequences, into two files, small_1.fasta and small_2.fasta. (Since there are only five sequences, the second file has only two sequences in it.)
Input file
(big.fasta)
 >seq1
 ACCTTGTCGCA
 >seq2
 ACCTTGTCGCAAAGC
 >seq3
 ACCTTGTCGCACCGGAACGA
 >seq4
 ACCTTGTCGCACCGGAACGACCGGAACGA
 >seq5
 GTCGCA
Output file 1
(small_1.fasta)
 >seq1
 ACCTTGTCGCA
 >seq2
 ACCTTGTCGCAAAGC
 >seq3
 ACCTTGTCGCACCGGAACGA
Output file 2
(small_2.fasta)
 >seq4
 ACCTTGTCGCACCGGAACGACCGGAACGA
 >seq5
 GTCGCA
Screen Output
 Split 5 FASTA records in 10 lines, with total sequence length 81
 Created 2 files like small_2.fasta

Choose data from a file

The scripts in this section select certain information from an input file. Some scripts choose lines that meet a certain criterion; others choose certain columns from tabular data.
To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.

Choose columns

Print all lines from a file, but print only certain columns from each line.

Choose columns. Optionally reorder them (choose_cols)

Print one or more columns from tab-separated data. The column numbers can be given in any order, so this tool can also be used to rearrange the column order.
@colsColumn(s) to choose (in desired order)
Input file(s)
Output file
perl -e ' @cols=(1, -1, 2); while(<>) { s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n" } warn "\nChose columns ", join(", ", @cols), " for $. lines\n\n" ' all_cols > some_cols_chosen
Example: Get second, last, and third column (subject sequence, score and percent identity) from blast tabular output file all_cols to get some_cols_chosen. (Note: some columns removed from input rows for simplicity.)
Input file
(all_cols)
 NP_438174.1    NP_110118.1    43.41   26       331     31      331     1e-61    228
 NP_438174.1    NP_110197.1    33.33   251      319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1    31.75   202      257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1    21.67   207      326     321     433     2.5     24.3
Output file
(some_cols_chosen)
 NP_110118.1    228     43.41
 NP_110197.1    26.6    33.33
 NP_110131.1    24.6    31.75
 NP_110177.1    24.3    21.67
Screen Output
 Chose columns 1, -1, 2 for 4 lines

Delete columns (choose_cols_to_delete)

@del_colColumn(s) to delete
Input file(s)
Output file
perl -e ' @del_col=(0, 3..6, -1); while(<>) { s/\r?\n//; @F=split /\t/, $_; foreach $col (sort { $b <=> $a } @del_col) { splice @F, $col, 1 }; print join("\t", @F),"\n"; } warn "\nDeleted columns ", join(", ", @del_col), " for $. lines\n\n" ' all_cols > some_cols_deleted
Example: Delete first, fourth through seventh column and last column from a blast tabular output. Retain just subject sequence ID, percent identity, and e-value. (Note: some columns removed from input rows for clarity.) Run the following script on all_cols to get some_cols_deleted.
Input file
(all_cols)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
 NP_438174.1    NP_110197.1     33.33   251     319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1     31.75   202     257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1     21.67   207     326     321     433     2.5     24.3
Output file
(some_cols_deleted)
 NP_110118.1    43.41   1e-61
 NP_110197.1    33.33   0.51
 NP_110131.1    31.75   1.9
 NP_110177.1    21.67   2.5
Screen Output
 Deleted columns 0, 3, 4, 5, 6, -1 for 4 lines
To choose lines containing certain values in certain columns, see Choose lines matching a specific text string or pattern.

Choose lines matching a specific text string or pattern

Choose lines containing a specific string, like "blah" or "23" (The latter will also match 123456), or lines matching a pattern, like "line begins with the letters 'CG'".

Choose lines containing a given text string (choose_lines_matching_text)

Choose lines containing a text string (which should be placed inside the q{} in the tool. If you input "bc" , "abcd" will match. If you input "23", "1.234" will match. Warning: Add a backslash to match ' or }. I.e., to match "Joe's", type "Joe\'s".
$stringText to match
Input file(s)
Output file
perl -e ' $string=q{>CG}$count=0; while(<>) { if (/\Q$string\E/) { print $_; $count++ } } warn "\nChose $count lines with string [$string] out of $. total lines.\n"; ' a.fsa > descs
Example: Get certain description lines from a FASTA file. Run the above script on a file called a.fsa to get descs, which will have all lines containing a '>CG'.
Input file (a.fsa)Output file (descs)Screen Output
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 >CG123
 Chose 1 lines with string [>CG] out of 5 total lines.

TODO: Choose lines NOT containing a given text string (choose_lines_not_matching_text)

Choose lines not containing an exact text string.

TODO: Choose lines where a given column is a given text string (choose_lines_col_equals_text)

Choose lines where the string in a given column equals a given text string. (If 'bc' is given, 'abcd' will NOT match.)

TODO: Choose lines where a given column is a given number (choose_lines_col_equals_number)

Choose lines where the value in a given column equals a certain number. (If '23' is given, '1234' will NOT match.)

Remove empty lines or lines containing just spaces (choose_nonempty_lines)

Remove any lines in a file that are completely empty, as well as lines that have spaces or tabs in them, but no other text.
Input file(s)
Output file
perl -e ' $count=0; while(<>) { if (/^\s*$/) { $count++ } else { print $_ } } warn "\nRemoved $count blank/whitespace lines out of $. total lines.\n\n" ' a.fasta > noblanks.fasta
Example: Remove empty lines from a FASTA file. Run the above script on a file called a.fsa to get noblanks, which will have all non-blank lines.
Input file (a.fasta)Output file (noblanks.fasta)Screen Output
 >CG123
 ACGTTGCA
         
 GTTACCAG
 >DG124
 
 GTTACCAG
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 Removed 2 blank/whitespace lines out of 7 total lines.

Choose lines where a column matches some mathematical criterion

Choose lines where a given column is greater than a given limit (choose_lines_col_more_than_limit)

Choose all lines in a tab-separated file where the value in a given column is numerically greater than a given limit.
$colColumn to limit
$limitMinimum value in column
Input file(s)
Output file
perl -e ' $col=-1$limit=80$count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) { $count++; print "$_\n" } } warn "\nChose $count lines out of $..\n\n" ' all.blast > only_big.blast
Example: Filter a previously-run BLAST tabular output all.blast. Pick only hits with bit score greater than 80 and put them in only_big.blast. Score is in the last (-1) column. (Note: some columns removed from input rows for simplicity.)
Input file
(all.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
 NP_438174.1    NP_110197.1     33.33   251     319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1     31.75   202     257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1     21.67   207     326     321     433     2.5     24.3
Output file
(only_big.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
Screen Output
 Chose 1 lines out of 4.

Choose lines where a given column is less than a given limit (choose_lines_col_less_than_limit)

Choose all lines in a tab-separated file where the value in a given column is numerically less than a given limit.
$colColumn to limit
$limitMaximum value in column
Input file(s)
Output file
perl -e ' $col=-2$limit=1e-10$count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$col] < $limit) { $count++; print "$_\n" } } warn "\nChose $count lines out of $..\n\n" ' all.blast > only_small.blast
Example: Filter a previously-run BLAST tabular output all.blast. Pick only hits with bit e-value less than 1e-10 and put them in only_small.blast. E-value is in the second to last (-2) column. (Note: some columns removed from input rows for simplicity.)
Input file
(all.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
 NP_438174.1    NP_110197.1     33.33   251     319     157     214     0.51    26.6
 NP_438174.1    NP_110131.1     31.75   202     257     33      94      1.9     24.6
 NP_438174.1    NP_110177.1     21.67   207     326     321     433     2.5     24.3
Output file
(only_small.blast)
 NP_438174.1    NP_110118.1     43.41   26      331     31      331     1e-61   228
Screen Output
 Chose 1 lines out of 4.

Choose lines with highest value in a column (choose_lines_with_col_max)

Find the maximum value in a given column, and print all lines that have that value in that column. (There may be more than one line with the maximum value.)
$colColumn to find maximum in
Input file(s)
Output file
perl -e ' $col=1while(<>) { s/\r?\n//; @F=split /\t/, $_; $s = $F[$col]; if (! defined($max) || $s > $max) { $max = $s; @best = () }; if ($s == $max) { push @best, "$_\n" }; } print @best; warn "\nChose ", scalar(@best), " lines out of $.\nMaximum value: $max\n\n"; ' fold_change.tab > col_max.tab
Example: Given a list of genes and fold changes in fold_change.tab, print the genes with the greatest fold change to col_max.tab. In this case, two genes have the same high fold change.
Input file
(fold_change.tab)
 COX2   3.1
 HSP90  1.7
 FGFR2  1.1
 AGFA   3.1
 PERU   1.05
Output file
(col_max.tab)
 COX2   3.1
 AGFA   3.1
Screen Output
 Chose 2 lines out of 5
 Maximum value: 3.1
Example: Set $col=0 to find the maximum in a simple list of numbers.

Choose lines with lowest value in a column (choose_lines_with_col_min)

Find the minimum value in a given column, and print all lines that have that value in that column. (There may be more than one line with the minimum value.)
$colColumn to find minimum in
Input file(s)
Output file
perl -e ' $col=1while(<>) { s/\r?\n//; @F=split /\t/, $_; $s = $F[$col]; if (! defined($min) || $s < $min) { $min = $s; @best = () }; if ($s == $min) { push @best, "$_\n" }; } print @best; warn "Chose ", scalar(@best), " lines out of $.\nMinimum value: $min\n\n"; ' blast.tab > col_min.tab
Example: TODO
Example: Set $col=0 to find the minimum in a simple list of numbers.

Choose lines with highest score for each name (choose_lines_with_max_per_name)

For each "name" (values in column m), find the maxium "score" (value in column n). Print all lines with each name, highest-score pair. (There may be more than one line for a given name that has the highest score.) The names will be printed in order, based on when they first appear in the file. For each name, the highest-scoring lines will be printed in the order they appear in the file.
$name_colColumn with identifiers / names
$score_colColumn with scores / values
Input file(s)
Output file
perl -e ' $name_col=0$score_col=2while(<>) { s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) { push @names, $n }; if (! exists($max{$n}) || $s > $max{$n}) { $max{$n} = $s; $best{$n} = () }; if ($s == $max{$n}) { $best{$n} .= "$_\n" }; } for $n (@names) { print $best{$n} } ' blast.tab > best_hits.tab
Example: Run the script on a BLAST hit table to find the BLAST hit for each query (query names in the first column) with the highest percent identity (in the third column).
TODO

Choose lines with lowest score for each name (choose_lines_with_min_per_name)

For each "name" (values in column m), find the minium "score" (value in column n). Print all lines with each name, lowest-score pair. (There may be more than one line for a given name that has the lowest score.) The names will be printed in order, based on when they first appear in the file. For each name, the lowest-scoring lines will be printed in the order they appear in the file.
$name_colColumn with identifiers / names
$score_colColumn with scores / values
Input file(s)
Output file
perl -e ' $name_col=0$score_col=-2while (<>) { s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($min{$n})) { push @names, $n }; if (! exists($min{$n}) || $s < $min{$n}) { $min{$n} = $s; $best{$n} = () }; if ($s == $min{$n}) { $best{$n} .= "$_\n"; } } for $n (@names) { print $best{$n}; } ' blast.tab > best_hits.tab
Example: Run the script on a BLAST hit table to find the BLAST hit for each query (query names in the first column) with the lowest e-value (in the second to last column).
TODO

Choose lines by comparing values in two columns

Choose lines where column m has the same text string as column n (choose_lines_col_m_equals_col_n_alpha)

Print all lines for which column m equals column n. ".1" and "0.1" are not considered equal. (See choose_lines_col_m_equals_col_n_num.)
$colmText column to compare
$colnOther text column to compare
Input file(s)
Output file
perl -e ' $colm=0$coln=1$count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) { print "$_\n"; $count++ } } warn "\nChose $count lines out of $. where column $colm had same text as column $coln\n\n"; ' infile > outfile
Example: TODO

Choose lines where column m numerically equals column n (choose_lines_col_m_equals_col_n_num)

Print all lines for which column m equals column n. ".1" and "0.1" are considered equal. (See choose_lines_col_m_equals_col_n_alpha.)
$colmNumerical column to compare
$colnOther numerical column to compare
Input file(s)
Output file
perl -e ' $colm=0$coln=1$count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] == $F[$coln]) { print "$_\n"; $count++ } } warn "\nChose $count lines out of $. where column $colm was numerically equal to column $coln\n\n"; ' infile > outfile
Example: TODO

Choose lines where column m is greater than column n (choose_lines_col_m_more_than_col_n)

Print all lines for which column m has a higher value than column n.
$colmNumerical column to compare
$colnOther numerical column to compare
Input file(s)
Output file
perl -e ' $colm=0$coln=1$count=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] > $F[$coln]) { print "$_\n"; $count++ } } warn "\nChose $count lines out of $. where column $colm was greater than column $coln\n\n"; ' infile > outfile
Example: TODO

Choose numbered lines or records from a file

Choose first N lines (choose_first_n_lines)

Print the first n lines from a file. (Or just use the UNIX command "head".)
$num_linesNumber of lines to choose
Input file(s)
Output file
perl -e ' $num_lines=2while(<>) { if ($. <= $num_lines) { print $_ } else { exit } } ' infile > outfile
Example: TODO

Choose last N lines (choose_last_n_lines)

Print the last n lines from a file. (Or just use the UNIX command "tail".)
$num_linesNumber of lines to choose
Input file(s)
Output file
perl -e ' $num_lines=2while(<>) { push @save, $_; if (@save > $num_lines) { shift @save } } print @save ' infile > outfile
Example: TODO

Remove first N lines (choose_all_but_first_n_lines)

Remove the first n lines from a file.
$del_linesNumber of lines to delete
Input file(s)
Output file
perl -e ' $del_lines=1while(<>) { if ($. > $del_lines) { print $_ } } ' infile > outfile
Example: Remove headers from tabular data. TODO

TODO: Choose first N FASTA records

TODO: Choose Nth FASTA record

Remove duplicate lines or records from a file

Remove duplicate lines from a file (choose_unique_lines)

Print lines from a file, removing duplicates. Only print the first occurrence of any given line. Note: having the same first column in tabular data is not the same as being a duplicate.
Input file(s)
Output file
perl -e ' $unique=0; while(<>) { if (!($save{$_}++)) { print $_; $unique++ } } warn "\nChose $unique unique lines out of $. total lines.\n\n" ' genes > unique
Example: Given a gene, remove any duplicates. Run the above script on file genes to get a file called unique:
Input file (genes)Output file (unique)Screen Output
 ap23
 ap23
 CG2500
 cxb7
 ap23
 CG2500
 cxb7
 Chose 3 unique lines out of 4 total lines.

Remove lines where values are repeated in a given column (choose_unique_lines_by_col)

Print lines from a tab-separated file, removing duplicates, where a duplicate is defined as a repeated value in a given column, even if other parts of the line are different. Only print the first line where each value appears in that given column. Note: 0.1 and .1 are not considered duplicates.
$columnColumn in which to look for unique values
Input file(s)
Output file
perl -e ' $column=0$unique=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) { print "$_\n"; $unique++ } } warn "\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n" ' hits > unique_hits
Example: Given a list containing target genes, hit genes, and scores, take only one hit per target gene. (If you only want the best hit for each target, use Choose lines with highest score for each name (choose_lines_with_max_per_name)). Run the above script on file hits to get a file called unique_hits:
Input file (hits)Output file (unique_hits)Screen Output
 ap23   ap24    25
 ap23   mmm     50
 CG2500 mmm     10
 cxb7   gly1    5
 ap23   ap24    25
 CG2500 mmm     10
 cxb7   gly1    5
 Chose 3 unique lines out of 4 total lines.
 Removed duplicates in column 0.

Remove duplicate FASTA records from a file (choose_unique_FASTA)

Print FASTA records (descriptions and sequences) from a file. Print only the first record for a given FASTA ID (everything on the description line up to the first space). So if a record has an already-printed ID with a different text description, it will still not be printed.
Input file(s)
Output file
perl -e ' $unique=0; $total=0; while(<>) { if (/^>\S+/) { $total++; if (! ($seen{$&}++)) { $unique++; $print_it = 1 } else { $print_it = 0 } }; if ($print_it) { print $_ }; } warn "\nChose $unique unique FASTA records out of $total total.\n\n"; ' a.fsa > unique.fsa
Example: Remove duplicate FASTAs. Run the above script on file a.fsa to get a file called unique.fsa:
Input file (a.fsa)Output file (unique.fsa)Screen Output
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 >CG123 second time
 ACGTTGCA
 GTTACCAG
 >CG123
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 Chose 2 unique FASTA records out of 3 total.

Choose lines or records from a file, using a list in another file

See the Merge page for more examples of this kind of choosing.

Choose lines from a table based on a list (choose_table_rows_from_list)

Given a table of tab-separated lines, choose only the rows where a given column contains values from a list.
This is just a special case of merge_lines_based_on_shared_column. Use column 0 for the list - i.e., treat the list as a table with just one column.

Choose a set of FASTA sequences from a file (choose_FASTAs_from_list)

Given a list of FASTA IDs and a FASTA file, print out the FASTA sequences for the given IDs. Sequences will be printed out only once, even if the ID/sequence appears more than once in either file. It is possible that some or all of the IDs in the list won't be found in the FASTA file.
A '>' and/or description in the ID list will be ignored; only the ID is read.
Input file(s)File with FASTA identifiers (no > signs)
Input file(s)FASTA file
Output file
perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' id_list a.fsa > found.fsa
Example: Run the above script on the list id_list and the FASTA file a.fsa to get found.fsa.
Input list file (id_list)input FASTA file (a.fsa)Output file (found.fsa)Screen Output
 CG123 A sequence I want
 CGnot_here
 >CG123 First time
 ACGTTGCA
 GTTACCAG
 >DG124
 GTTACCAG
 >CG123 second time
 ACGTTGCA
 GTTACCAG