Tuesday, November 13, 2012



Sequence Analysis Protocols

Protocols for data manipulation related to sequence analysis.
To use a protocol, just cut and paste the scripts in the colored/dashed-line boxes onto a UNIX/Mac or Windows command line. You can copy and run the individual tools one by one, or all at once if you're feeling lucky.
If you're happy with the filenames given in the protocol, just make sure that your starting file(s) have the right name. Otherwise, you can copy the whole protocol into an editor (vi, Notepad) and change the filenames, then cut and paste onto the command line.

Analyzing BLAST Results

Reciprocal Best Hit BLAST

A method for finding potential orthologs (evolutionarily related sequences).
Input: files A_B and B_A contain all-against-all blasts of genome A against genome B and vice versa. The blasts should be done with the -m8 option for tabular output.
Tool NameInput filesOutput
choose_lines_with_max_per_nameA_BA_B.best
choose_lines_with_max_per_nameB_AB_A.best
merge_lines_based_on_shared_columnA_B.best B_A.bestA_B_A
choose_lines_col_m_equals_col_n_alphaA_B_AA_B_A.recip
choose_colsA_B_A.recipRBHB.out
Note: if there are multiple hits with the same score for a given query, you may get multiple "best hits". In that case, you may want to use a tool to remove duplicates in the query columns.
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' A_B > A_B.best
perl -e '$name_col=0; $score_col=11; while(<>) {s/\r?\n//; @F=split /\t/, $_; ($n, $s) = @F[$name_col, $score_col]; if (! exists($max{$n})) {push @names, $n}; if (! exists($max{$n}) || $s > $max{$n}) {$max{$n} = $s; $best{$n} = ()}; if ($s == $max{$n}) {$best{$n} .= "$_\n"};} for $n (@names) {print $best{$n}}' B_A > B_A.best
perl -e '$col1=1; $col2=0;' -e '($f1,$f2)=@ARGV; open(F1,$f1); while (<F1>) {s/\r?\n//; @F=split /\t/, $_; $line1{$F[$col1]} .= "$_\n"} open(F2,$f2); while (<F2>) {s/\r?\n//;@F=split /\t/, $_; if ($x = $line1{$F[$col2]}) {$x =~ s/\n/\t$_\n/g; print $x}}' A_B.best B_A.best > A_B_A
perl -e '$colm=0; $coln=13; $count=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if ($F[$colm] eq $F[$coln]) {print "$_\n"; $count++}} warn "\nChose $count lines out of $. where column $colm had same text as column $coln\n\n";' A_B_A > A_B_A.recip
perl -e '@cols=(0, 1, 11); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' A_B_A.recip > RBHB.out

Formatting and Filtering FASTA Files

If you change FASTA files to tabular format, you can use all of the tools that filter and format tabular data, and change back to FASTA format at the end if desired.

Remove duplicate sequences in FASTA

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. This protocol removes duplicate sequences based on the nucleotide / amino acid sequence, rather than ID.
Tool NameInput filesOutput
change_fasta_to_tabdup.fastadup.tab
choose_unique_lines_by_coldup.tabunique.tab
change_tab_to_fastaunique.tabunique.fasta
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'dup.fasta > dup.tab
perl -e '$column = 2; $unique=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; if (! ($save{$F[$column]}++)) {print "$_\n"; $unique++}} warn "\nChose $unique unique lines out of $. total lines.\nRemoved duplicates in column $column.\n\n"' dup.tab > unique.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' unique.tab > unique.fasta

Remove short sequences in FASTA

Some FASTA files have very short sequences (e.g., with N's or repeat elements removed), that will not provide meaningful results when running BLAST or other analyses. This protocol removes sequences whose lengths are less than 30 nucleotides / amino acids. (In order to change the length, cut and paste the whole protocol into an editor and change the $limit parameter in the second tool.)
Tool NameInput filesOutput
change_fasta_to_tablong_short.fastalong_short.tab
calc_col_lengthlong_short.tabls_length.tab
choose_lines_col_more_than_limitls_length.tabonly_long.tab
choose_colsonly_long.tabonly_long_three_cols.tab
change_tab_to_fastaonly_long_three_cols.tabonly_long.fasta
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'long_short.fasta > long_short.tab
perl -e '$col = 2;' -e 'while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n";' long_short.tab > ls_length.tab
perl -e '$col=3; $limit=30; while(<>) {BEGIN {$count=0} s/\r?\n//; @F=split /\t/, $_; if ($F[$col] > $limit) {$count++; print "$_\n"}} warn "\nChose $count lines out of $..\n\n"' ls_length.tab > only_long.tab
perl -e '@cols=(0, 1, 2); while(<>) {s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n"} warn "\nJoined columns ", join(", ", @cols), " for $. lines\n\n"' only_long.tab > only_long_three_cols.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";'only_long_three_cols.tab > only_long.fasta

Change FASTA files to have sixty characters per line of sequence

FASTA files may have entire sequences (hundreds or thousands of characters) in a single line. A few programs cannot except these long sequence lines. To change a FASTA file to have only 60 characters per line, all you need to do is translate the FASTA file to tabular format and back.
Tool NameInput filesOutput
change_fasta_to_tablong_seq.fastalong_seq.tab
change_tab_to_fastalong_seq.tabshort_seq.fasta
perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";'long_seq.fasta > long_seq.tab
perl -e '$len=0; while(<>) {s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) {print " $F[1]"} print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n";} warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n";' long_seq.tab > short_seq.fasta

TODO: Choose FASTA sequences containing an exact subsequence

Note: If the ID or description column happen to contain the sequence, then choose_lines_matching_text will find them. This shouldn't happen very often. It would be nicer if you could search within a column, though

No comments:

Post a Comment