Calculate simple statistics
The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.
To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.
Calculate statistics about a column
Calculate sum of the nth column of tabular data (calc_col_sum)
(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)
Example: Sum second column of 
file.tab by running the above script.Input file (file.tab) | Screen output | 
|---|---|
Fly 7 Human 14 Worm 28 Yeast 35  | Sum of column 1 for 4 lines 84  | 
Calculate length of a given column on each line (calc_col_length)
For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.
Example: Take a FASTA file 
seqs.tab that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file seqs_length.tab.| Input file ( seqs.tab) | SEQ1 First seq ACTGACTG SEQ2 Second seq ACTG SEQ3 Third seq SEQ4 Third seq ACTGACTGACTG  | 
|---|---|
| Output file ( seqs_length.tab) | SEQ1 First seq ACTGACTG 8 SEQ2 Second seq ACTG 4 SEQ3 Third seq 0 SEQ4 Third seq ACTGACTGACTG 12  | 
| Screen Output | Added column with length of column 2 for 4 lines.  | 
Calculate statistics about each row/line
Insert line numbers (calc_line_numbers)
For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.
Example: Add line numbers to a list of genes 
gene_list.txt to generate a numbered list numbered_gene_list.txt by running the above script.| Input file ( gene_list.txt) | Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism  | 
|---|---|
| Output file ( numbered_gene_list.txt) | 1 Hsp90 Heat shock protein 2 apo1 apoptosis-related protein 3 glu7 Glucose metabolism  | 
| Screen Output | Inserted line numbers for 3 lines, with separator  | 
Calculate sum of two or more columns for each row (calc_row_sum)
Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)
For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.
Example: Given a list of genes with associated GO terms 
gene_go.txt, make a new file go_repeats.txt showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)| Input file ( gene_go.txt) | Hsp90 GO:00171 apo1 GO:00012 apo1 GO:00233 apo1 GO:01234 glu7 GO:00012 glu7 GO:56785  | 
|---|---|
| Output file ( go_repeats.txt) | GO:00171 1 GO:00012 2 GO:00233 1 GO:01234 1 GO:56785 1  | 
| Screen Output | Printed number of occurrences for 5 values in 6 lines.  | 
Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)
Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.
Example: Calculate the sum of the length of the exons for each gene.
| Input file ( exon_length.tab) | Hsp90 exon1 300 Hsp90 exon2 100 Hsp90 exon3 250 apo1 exon1 100 apo1 exon2 350  | 
|---|---|
| Output file ( gene_length.tab) | Hsp90 650 apo1 450  | 
| Screen Output | Printed sum of column 2 for each value in column 0 Found 2 values in 5 lines  | 
Count lines or records in a file
Count lines or records in a file
Count lines in a file (calc_num_lines)
Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)
Example: Count how many genes are in file 
gene_list.txt by running the above script.| Input file ( gene_list.txt) | Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism  | 
|---|---|
| Output file ( gene_count.txt) | Counted 3 lines  | 
| Screen Output | Counted 3 lines  | 
UNIX/Mac users: also check out the wc command.
Count records in a FASTA file (calc_num_fasta_records)
Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)| Input file ( seqs.fasta) | >CG123 A small sequence ACGTTGCA GTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG  | 
|---|---|
| Output file (seqs_count.txt)  | Read 3 FASTA records in 7 lines. Total sequence length: 30  | 
| Screen Output | Read 3 FASTA records in 7 lines. Total sequence length: 30  | 
No comments:
Post a Comment