Perl -e

Calculate simple statistics

The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.

To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.

See More Information for notes on using these tools.

Calculate statistics about a column

Calculate sum of the nth column of tabular data (calc_col_sum)

(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)

Example: Sum second column of file.tab by running the above script.

Input file (`file.tab`)	Screen output
Fly 7 Human 14 Worm 28 Yeast 35	Sum of column 1 for 4 lines 84

Calculate length of a given column on each line (calc_col_length)

For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.

Example: Take a FASTA file seqs.tab that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file seqs_length.tab.

Input file (`seqs.tab`)	SEQ1 First seq ACTGACTG SEQ2 Second seq ACTG SEQ3 Third seq SEQ4 Third seq ACTGACTGACTG
Output file (`seqs_length.tab`)	SEQ1 First seq ACTGACTG 8 SEQ2 Second seq ACTG 4 SEQ3 Third seq 0 SEQ4 Third seq ACTGACTGACTG 12
Screen Output	Added column with length of column 2 for 4 lines.

Input file
(seqs.tab)

 SEQ1   First seq       ACTGACTG
 SEQ2   Second seq      ACTG
 SEQ3   Third seq       
 SEQ4   Third seq       ACTGACTGACTG

Output file
(seqs_length.tab)

 SEQ1   First seq       ACTGACTG        8
 SEQ2   Second seq      ACTG    4
 SEQ3   Third seq               0
 SEQ4   Third seq       ACTGACTGACTG    12

Screen Output

 Added column with length of column 2 for 4 lines.

Calculate statistics about each row/line

Insert line numbers (calc_line_numbers)

For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.

Example: Add line numbers to a list of genes gene_list.txt to generate a numbered list numbered_gene_list.txt by running the above script.

Input file
(gene_list.txt)

 Hsp90  Heat shock protein
 apo1   apoptosis-related protein
 glu7   Glucose metabolism

Output file
(numbered_gene_list.txt)

 1      Hsp90   Heat shock protein
 2      apo1    apoptosis-related protein
 3      glu7    Glucose metabolism

Screen Output

 Inserted line numbers for 3 lines, with separator

Calculate sum of two or more columns for each row (calc_row_sum)

This tool takes tab-separated data. For each row, it calculates the sum of two or more columns. It adds a new, last column containing the sums.

Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)

For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.

Example: Given a list of genes with associated GO terms gene_go.txt, make a new file go_repeats.txt showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)

Input file (`gene_go.txt`)	Hsp90 GO:00171 apo1 GO:00012 apo1 GO:00233 apo1 GO:01234 glu7 GO:00012 glu7 GO:56785
Output file (`go_repeats.txt`)	GO:00171 1 GO:00012 2 GO:00233 1 GO:01234 1 GO:56785 1
Screen Output	Printed number of occurrences for 5 values in 6 lines.

Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)

Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.

Example: Calculate the sum of the length of the exons for each gene.

Input file (`exon_length.tab`)	Hsp90 exon1 300 Hsp90 exon2 100 Hsp90 exon3 250 apo1 exon1 100 apo1 exon2 350
Output file (`gene_length.tab`)	Hsp90 650 apo1 450
Screen Output	Printed sum of column 2 for each value in column 0 Found 2 values in 5 lines

Count lines or records in a file

Count lines in a file (calc_num_lines)

Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)

Example: Count how many genes are in file gene_list.txt by running the above script.

Input file (`gene_list.txt`)	Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism
Output file (`gene_count.txt)`	Counted 3 lines
Screen Output	Counted 3 lines

UNIX/Mac users: also check out the wc command.

Count records in a FASTA file (calc_num_fasta_records)

Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)

Example: See how many sequences are in seqs.fasta.

Input file (`seqs.fasta`)	>CG123 A small sequence ACGTTGCA GTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG
Output file (seqs_count.txt)	Read 3 FASTA records in 7 lines. Total sequence length: 30
Screen Output	Read 3 FASTA records in 7 lines. Total sequence length: 30

Input file
(seqs.fasta)

 >CG123 A small sequence
 ACGTTGCA
 GTTACCAG
 >EG12
 ACCGGA
 >DG124  A smaller sequence
 GTTACCAG

Output file
(seqs_count.txt)

 Read 3 FASTA records in 7 lines. Total sequence length: 30

Screen Output

 Read 3 FASTA records in 7 lines. Total sequence length: 30

Calculate line numbers for each line

Example: Sum second, third, and fourth column of file.tab by running the above script. Results go in new, sixth column

To calculate the sum of all columns in the row, set @cols=(0 .. $#F).

TODO

$col		Column to measure length of
Input file(s)
Output file

$separator		What to print between line number and rest of line - \t means tab
Input file(s)
Output file

$col		Column to count repeats in
Input file(s)
Output file

$value_col		Column to determine grouping of lines
$sum_col		Column to sum for sets of similar lines
Input file(s)
Output file

evolution & bioinformatics

Tuesday, November 13, 2012

Calculate simple statistics

Calculate statistics about a column

Calculate sum of the nth column of tabular data (calc_col_sum)

Calculate length of a given column on each line (calc_col_length)

Calculate statistics about each row/line

Insert line numbers (calc_line_numbers)

Calculate sum of two or more columns for each row (calc_row_sum)

Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)

Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)

Count lines or records in a file

Count lines or records in a file

Count lines in a file (calc_num_lines)

Count records in a FASTA file (calc_num_fasta_records)

Calculate line numbers for each line

No comments:

Post a Comment

About Me