Calculate simple statistics
The tools in this section calculate very simple statistics about the data (often tabular data) in given input files.
To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.
Calculate statistics about a column
Calculate sum of the nth column of tabular data (calc_col_sum)
(This tool should not be confused with calc_row_sum, which calculates the sum of all columns for each row.)
Example: Sum second column of
file.tab
by running the above script.Input file (file.tab ) | Screen output |
---|---|
Fly 7 Human 14 Worm 28 Yeast 35 | Sum of column 1 for 4 lines 84 |
Calculate length of a given column on each line (calc_col_length)
For a given column of a tab-separated file, calculate the length of the text in that column. For each line, add a column at the end of the line with the length of the chosen column.
Example: Take a FASTA file
seqs.tab
that we've converted to tabular format using change_fasta_to_tab. Calculate the length of column 2 (the third column), which has the sequence in it, by running the above script. Create a new, 4-column, file seqs_length.tab
.Input file ( seqs.tab ) | SEQ1 First seq ACTGACTG SEQ2 Second seq ACTG SEQ3 Third seq SEQ4 Third seq ACTGACTGACTG |
---|---|
Output file ( seqs_length.tab ) | SEQ1 First seq ACTGACTG 8 SEQ2 Second seq ACTG 4 SEQ3 Third seq 0 SEQ4 Third seq ACTGACTGACTG 12 |
Screen Output | Added column with length of column 2 for 4 lines. |
Calculate statistics about each row/line
Insert line numbers (calc_line_numbers)
For each line in a file, print the line number followed by a separator (by default, a tab, represented by \t), and then the rest of the line.
Example: Add line numbers to a list of genes
gene_list.txt
to generate a numbered list numbered_gene_list.txt
by running the above script.Input file ( gene_list.txt ) | Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism |
---|---|
Output file ( numbered_gene_list.txt ) | 1 Hsp90 Heat shock protein 2 apo1 apoptosis-related protein 3 glu7 Glucose metabolism |
Screen Output | Inserted line numbers for 3 lines, with separator |
Calculate sum of two or more columns for each row (calc_row_sum)
Calculate how many times each value appears in a given column (calc_repeats_for_each_value_in_col)
For a given column of a tab-separated file, count how many times each value appears in that column. Each line of the output will have a value, a tab, and the number of times it appears. Values will be printed in the order of their first appearance.
Example: Given a list of genes with associated GO terms
gene_go.txt
, make a new file go_repeats.txt
showing how many times each GO term is found by running the above script. This could be used to find whether a certain biological process is over-represented in a list of genes, for example. (Note: changing the column to 0 would calculate how many GO terms each gene was associated with.)Input file ( gene_go.txt ) | Hsp90 GO:00171 apo1 GO:00012 apo1 GO:00233 apo1 GO:01234 glu7 GO:00012 glu7 GO:56785 |
---|---|
Output file ( go_repeats.txt ) | GO:00171 1 GO:00012 2 GO:00233 1 GO:01234 1 GO:56785 1 |
Screen Output | Printed number of occurrences for 5 values in 6 lines. |
Calculate sum of values in a column for each value in another column (calc_sum_of_col_for_groups_of_lines)
Find sets of rows that have the same value in column m. Then get the sum of column n for those sets of rows. Each line of the output will have a value, a tab, and the sum for that value. Values will be printed in the order of their first appearance.
Example: Calculate the sum of the length of the exons for each gene.
Input file ( exon_length.tab ) | Hsp90 exon1 300 Hsp90 exon2 100 Hsp90 exon3 250 apo1 exon1 100 apo1 exon2 350 |
---|---|
Output file ( gene_length.tab ) | Hsp90 650 apo1 450 |
Screen Output | Printed sum of column 2 for each value in column 0 Found 2 values in 5 lines |
Count lines or records in a file
Count lines or records in a file
Count lines in a file (calc_num_lines)
Simply give a count of the number of lines in a file. (The result is printed to an output file as well as the screen.)
Example: Count how many genes are in file
gene_list.txt
by running the above script.Input file ( gene_list.txt ) | Hsp90 Heat shock protein apo1 apoptosis-related protein glu7 Glucose metabolism |
---|---|
Output file ( gene_count.txt) | Counted 3 lines |
Screen Output | Counted 3 lines |
UNIX/Mac users: also check out the wc command.
Count records in a FASTA file (calc_num_fasta_records)
Counts the number of records (and, for convenience, total sequence length) in a FASTA file. (The result is printed to an output file as well as the screen.)Input file ( seqs.fasta ) | >CG123 A small sequence ACGTTGCA GTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG |
---|---|
Output file (seqs_count.txt) | Read 3 FASTA records in 7 lines. Total sequence length: 30 |
Screen Output | Read 3 FASTA records in 7 lines. Total sequence length: 30 |
No comments:
Post a Comment