Tuesday, November 13, 2012

  Perl -e


Calculate simple statistics

The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.
To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.

Calculate statistics about a column

Calculate sum of the nth column of tabular data (calc_col_sum)

(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)
$colColumn to sum
Input file(s)
perl -e ' $col=1while(<>) { s/\r?\n//; @F=split /\t/, $_; $sum += $F[$col]; } warn "\nSum of column $col for $. lines\n\n"; print "$sum\n" ' file.tab
Example: Sum second column of file.tab by running the above script.
Input file (file.tab)Screen output
 Fly    7
 Human  14
 Worm   28
 Yeast  35
 Sum of column 1 for 4 lines
 
 84

Calculate length of a given column on each line (calc_col_length)

For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.
$colColumn to measure length of
Input file(s)
Output file
perl -e ' $col=2while (<>) { s/\r?\n//; @F = split /\t/, $_; $len = length($F[$col]); print "$_\t$len\n" } warn "\nAdded column with length of column $col for $. lines.\n\n"; ' seqs.tab > seqs_length.tab
Example: Take a FASTA file seqs.tab that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file seqs_length.tab.
Input file
(seqs.tab)
 SEQ1   First seq       ACTGACTG
 SEQ2   Second seq      ACTG
 SEQ3   Third seq       
 SEQ4   Third seq       ACTGACTGACTG
Output file
(seqs_length.tab)
 SEQ1   First seq       ACTGACTG        8
 SEQ2   Second seq      ACTG    4
 SEQ3   Third seq               0
 SEQ4   Third seq       ACTGACTGACTG    12
Screen Output
 Added column with length of column 2 for 4 lines.

Calculate statistics about each row/line

Insert line numbers (calc_line_numbers)

For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.
$separatorWhat to print between line number and rest of line - \t means tab
Input file(s)
Output file
perl -e ' $separator="\t"; while (<>) { print "$.$separator$_" } warn "\nInserted line numbers for $. lines, with separator $separator.\n\n" ' gene_list.txt > numbered_gene_list.txt
Example: Add line numbers to a list of genes gene_list.txt to generate a numbered list numbered_gene_list.txt by running the above script.
Input file
(gene_list.txt)
 Hsp90  Heat shock protein
 apo1   apoptosis-related protein
 glu7   Glucose metabolism
Output file
(numbered_gene_list.txt)
 1      Hsp90   Heat shock protein
 2      apo1    apoptosis-related protein
 3      glu7    Glucose metabolism
Screen Output
 Inserted line numbers for 3 lines, with separator 


Calculate sum of two or more columns for each row (calc_row_sum)

@colsWhich column(s) to add
Input file(s)
Output file
perl -e ' @cols=(1, 2, 3); while(<>) { s/\r?\n//; @F=split /\t/, $_; $sum = 0; foreach $col (@cols) { $sum += $F[$col] }; print "$_\t$sum\n"; } warn "\nSum of columns @cols for each line ($. lines)\n\n" ' in.tab > out.tab

Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)

For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.
$colColumn to count repeats in
Input file(s)
Output file
perl -e ' $col=1while (<>) { s/\r?\n//; @F = split /\t/, $_; $val = $F[$col]; if (! exists $count{$val}) { push @order, $val } $count{$val}++; } foreach $val (@order) { print "$val\t$count{$val}\n" } warn "\nPrinted number of occurrences for ", scalar(@order), " values in $. lines.\n\n"; ' gene_go.txt > go_repeats.txt
Example: Given a list of genes with associated GO terms gene_go.txt, make a new file go_repeats.txt showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)
Input file
(gene_go.txt)
 Hsp90  GO:00171
 apo1   GO:00012
 apo1   GO:00233
 apo1   GO:01234
 glu7   GO:00012
 glu7   GO:56785
Output file
(go_repeats.txt)
 GO:00171       1
 GO:00012       2
 GO:00233       1
 GO:01234       1
 GO:56785       1
Screen Output
 Printed number of occurrences for 5 values in 6 lines.

Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)

Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.
$value_colColumn to determine grouping of lines
$sum_colColumn to sum for sets of similar lines
Input file(s)
Output file
perl -e ' $value_col=0$sum_col=2while (<>) { s/\r?\n//; @F = split /\t/, $_; $val = $F[$value_col]; if (! exists $sum{$val}) { push @order, $val } $sum{$val} += $F[$sum_col]; } foreach $val (@order) { print "$val\t$sum{$val}\n" } warn "\nPrinted sum of column $sum_col for each value in column $value_col\nFound ", scalar(@order), " values in $. lines\n\n"; ' exon_length.tab > gene_length.tab
Example: Calculate the sum of the length of the exons for each gene.
Input file
(exon_length.tab)
 Hsp90  exon1   300
 Hsp90  exon2   100
 Hsp90  exon3   250
 apo1   exon1   100
 apo1   exon2   350
Output file
(gene_length.tab)
 Hsp90  650
 apo1   450
Screen Output
 Printed sum of column 2 for each value in column 0
 Found 2 values in 5 lines

Count lines or records in a file

Count lines or records in a file

Count lines in a file (calc_num_lines)

Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)
Input file(s)
Output file
perl -e ' while (<>) { } print "Counted $. lines\n"; warn "\nCounted $. lines\n\n" ' gene_list.txt > gene_count.txt
Example: Count how many genes are in file gene_list.txt by running the above script.
Input file
(gene_list.txt)
 Hsp90  Heat shock protein
 apo1   apoptosis-related protein
 glu7   Glucose metabolism
Output file
(gene_count.txt)
 Counted 3 lines
Screen Output
 Counted 3 lines
UNIX/Mac users: also check out the wc command.

Count records in a FASTA file (calc_num_fasta_records)

Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)

Input file(s)
Output file


perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; if (/^>/) { $count++; } else { $len += length($_) } } print "Read $count FASTA records in $. lines. Total sequence length: $len\n"; warn "\nRead $count FASTA records in $. lines. Total sequence length: $len\n\n"; ' seqs.fasta > seqs_count.txt
Example: See how many sequences are in seqs.fasta.
Input file
(seqs.fasta)
 >CG123 A small sequence
 ACGTTGCA
 GTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG
Output file
(seqs_count.txt)
 Read 3 FASTA records in 7 lines. Total sequence length: 30
Screen Output
 Read 3 FASTA records in 7 lines. Total sequence length: 30

Calculate line numbers for each line








No comments:

Post a Comment