Sequence Analysis Protocols
Protocols for data manipulation related to sequence analysis.
To use a protocol, just cut and paste the scripts in the colored/dashed-line boxes onto a UNIX/Mac or Windows command line. You can copy and run the individual tools one by one, or all at once if you're feeling lucky.
If you're happy with the filenames given in the protocol, just make sure that your starting
file(s)
have the right name. Otherwise, you can copy the whole protocol into an editor (vi, Notepad) and change the filenames, then cut and paste onto the command line.Analyzing BLAST Results
Reciprocal Best Hit BLAST
A method for finding potential orthologs (evolutionarily related sequences).
Input: files A_B and B_A contain all-against-all blasts of genome A against genome B and vice versa. The blasts should be done with the -m8 option for tabular output.
Tool Name | Input files | Output |
---|---|---|
choose_lines_with_max_per_name | A_B | A_B.best |
choose_lines_with_max_per_name | B_A | B_A.best |
merge_lines_based_on_shared_column | A_B.best B_A.best | A_B_A |
choose_lines_col_m_equals_col_n_alpha | A_B_A | A_B_A.recip |
choose_cols | A_B_A.recip | RBHB.out |
Note: if there are multiple hits with the same score for a given query, you may get multiple "best hits". In that case, you may want to use a tool to remove duplicates in the query columns.
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' A_B > A_B.best |
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' B_A > B_A.best |
perl -e '$col1=1; $col2=0;' -e '($f1,$f2)=@ARGV; open(F1,$f1); while (<F1>) {s/\r?\n//; @F=split /\t/, $_; $line1{$F[$col1]} .= "$_\n"} open(F2,$f2); while (<F2>) {s/\r?\n//;@F=split /\t/, $_; if ($x = $line1{$F[$col2]}) {$x =~ s/\n/\t$_\n/g; print $x}}' A_B.best B_A.best > A_B_A |
perl -e '$colm=0; $coln=13; $count=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) {print "$_\n"; $count++}} warn "\nChose $count lines out of $. where column $colm had same text as column $coln\n\n";' A_B_A > A_B_A.recip |
perl -e '@cols=(0, 1, 11); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' A_B_A.recip > RBHB.out |
Formatting and Filtering FASTA Files
If you change FASTA files to tabular format, you can use all of the tools that filter and format tabular data, and change back to FASTA format at the end if desired.
Remove duplicate sequences in FASTA
Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. This protocol removes duplicate sequences based on the nucleotide / amino acid sequence, rather than ID.
Tool Name | Input files | Output |
---|---|---|
change_fasta_to_tab | dup.fasta | dup.tab |
choose_unique_lines_by_col | dup.tab | unique.tab |
change_tab_to_fasta | unique.tab | unique.fasta |
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'dup.fasta > dup.tab |
perl -e '$column = 2; $unique=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) {print "$_\n"; $unique++}} warn "\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n"' dup.tab > unique.tab |
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' unique.tab > unique.fasta |
Remove short sequences in FASTA
Some FASTA files have very short sequences (e.g., with N's or repeat elements removed), that will not provide meaningful results when running BLAST or other analyses. This protocol removes sequences whose lengths are less than 30 nucleotides / amino acids. (In order to change the length, cut and paste the whole protocol into an editor and change the $limit parameter in the second tool.)
Tool Name | Input files | Output |
---|---|---|
change_fasta_to_tab | long_short.fasta | long_short.tab |
calc_col_length | long_short.tab | ls_length.tab |
choose_lines_col_more_than_limit | ls_length.tab | only_long.tab |
choose_cols | only_long.tab | only_long_three_cols.tab |
change_tab_to_fasta | only_long_three_cols.tab | only_long.fasta |
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'long_short.fasta > long_short.tab |
perl -e '$col = 2;' -e 'while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n";' long_short.tab > ls_length.tab |
perl -e '$col=3; $limit=30; while(<>) {BEGIN {$count=0} s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) {$count++; print "$_\n"}} warn "\nChose $count lines out of $..\n\n"' ls_length.tab > only_long.tab |
perl -e '@cols=(0, 1, 2); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' only_long.tab > only_long_three_cols.tab |
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";'only_long_three_cols.tab > only_long.fasta |
Change FASTA files to have sixty characters per line of sequence
FASTA files may have entire sequences (hundreds or thousands of characters) in a single line. A few programs cannot except these long sequence lines. To change a FASTA file to have only 60 characters per line, all you need to do is translate the FASTA file to tabular format and back.
Tool Name | Input files | Output |
---|---|---|
change_fasta_to_tab | long_seq.fasta | long_seq.tab |
change_tab_to_fasta | long_seq.tab | short_seq.fasta |
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'long_seq.fasta > long_seq.tab |
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' long_seq.tab > short_seq.fasta |
TODO: Choose FASTA sequences containing an exact subsequence
Note: If the ID or description column happen to contain the sequence, then choose_lines_matching_text will find them. This shouldn't happen very often. It would be nicer if you could search within a column, though