Change files
The scripts in this section perform simple transformations of entire files or lines in files.
To use a script, cut and paste the code from the light green or blue box into a terminal window, change the bold, red text as needed, and hit Enter.
See More Information for notes on using these tools.
Change columns in each line of files
Change tabular data with a given separator to tab-separated values (change_any_separator_to_tab)
This tool is important because most of the Scriptome tools require tab-separated data.
Warning: a few weird separators (like ' or ``) might not work. (It might help to put a backslash before it.)
Example: Change comma-separated
file.csv
to tab-separated file.tab
by running the above script.Input file (file.csv ) | Output file (file.tab ) | Screen Output |
---|---|---|
Fly,7 Human,14 Worm,28 Yeast,35 | Fly 7 Human 14 Worm 28 Yeast 35 | Changed , to tab on 4 lines |
Example 2: Given a list of Swiss-Prot identifiers, separate the protein name and species abbreviation into two separate columns. Run the above script using $sep="_"
Change tab-separated data to use a different delimiter (change_tab_to_any_separator)
Replace the tabs in tab-separated data with some other separator. The separator does not have to be one character: "---" would work, for example, or even "" to merge all columns.
This tool is important because most of the Scriptome tools require tab-separated data. After running one or more Scriptome tools, use this script to export data back to other programs which expect comma-separated data, for example.
Warning: a few weird separators (like ' or ``) might not work. Also, if there's a comma in your data, and you change to a comma separator, you'll get too many columns.
Example: Change tab-separated
file.tab
to comma-separated file.csv
by running the above script.Input file (file.tab ) | Output file (file.csv ) | Screen Output |
---|---|---|
Fly 7 Human 14 Worm 28 Yeast 35 | Fly,7 Human,14 Worm,28 Yeast,35 | Changed tab to , on 4 lines |
Reorder columns
Use choose_cols. Choose some or all of the columns, in whatever order you want. To switch the order of the first two columns of a ten-column file, you could set @cols to be 1, 0, 2..9.
Change entire lines in files
Remove spaces in a line (change_remove_spaces)
Remove all spaces (but not tabs) from a line.
Example: TODO
See Also: remove empty lines (Choose)
Change all characters to upper case (change_upper_case)
Change all characters in each line to upper case. Numbers and punctuation will not be changed. (Change "uc" in the script to "lc" to get lower case.)
Example: Change a list of gene names to upper-case, to compare with another list.
Change between different biological data formats
Change a FASTA file into tabular format (change_fasta_to_tab)
Change each FASTA sequence in a file into one line of three, tab-separated columns: the ID (not including the '>'); the rest of the description line (or an empty column if the description line contains only an ID); and the sequence itself.
Once you have run this script, you can use the many Scriptome tools that work on tab-separated data.
Note: translating to FASTA format and back will generate a file with the same information, but the files may not be identical. This tool will replace any tabs with single spaces (otherwise the tabular output file will have too many columns) and removes any spaces from the amino acid or nucleic acid sequences.
Example: Run the above script on
seqs.fna
to get seqs.tab
.Input file (seqs.fna ) | Output file (seqs.tab ) | Screen Output |
---|---|---|
>CG123 A small sequence ACGTTGCA GTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG | CG123 A small sequence ACGTTGCAGTTACCAG EG12 ACCGGA DG124 A smaller sequence GTTACCAG | Converted 3 FASTA records in 7 lines to tabular format Total sequence length: 30 |
Change a tabular file into FASTA format (change_tab_to_fasta)
Change each line in a three-column, tab-separated file (containing ID, description and sequence - e.g., a file created by the above change_tab_to_fasta tool) to FASTA sequence.
Note: translating to FASTA format and back will generate a file with the same information, but the files may not be identical. This tool will put a single space between the ID and the description, and will put 60 characters per line in the sequence portion.
Example: Run the above script on
seqs.tab
to get seqs.fasta
.Input file (seqs.tab ) | Output file (seqs.fasta ) | Screen Output |
---|---|---|
CG123 A small sequence ACGTTGCAGTTACCAG EG12 ACCGGA DG124 A smaller sequence GTTACCAG | >CG123 A small sequence ACGTTGCAGTTACCAG >EG12 ACCGGA >DG124 A smaller sequence GTTACCAG | Converted 3 tab-delimited lines to FASTA format Total sequence length: 30 |
Change from one biological format to another (change_bio_format_to_bio_format)
Change files with one or more sequences into a different format. The input and output formats can be embl, fasta, gcg, genbank, swiss, or a whole bunch of other formats: see
The Bioperl SeqIO HOWTO
for details.
Warning: Converting from genbank to FASTA (for example) will necessarily lose some annotation information.
This script requires Bioperl to be installed (on whichever machine the script runs on). Many biology computers will have it installed. If the script breaks because it "can't locate Bio/Perl.pm", you can download Bioperl from bioperl.org.
Example: TODO
Change entire files
Transpose a table (change_transpose_table)
Change rows to columns and vice versa, for a tab-separated file. Data should have the same number of columns in every row.
Example: Transpose the table
original.tab
to get transposed.tab
.Input file ( original.tab ) | Top Col2 Col3 Row2 r2c2 r2c3 Row3 r3c2 r3c3 Row4 r4c2 r4c3 Row5 r5c2 r5c3 |
---|---|
Output file ( transposed.tab ) | Top Row2 Row3 Row4 Row5 Col2 r2c2 r3c2 r4c2 r5c2 Col3 r2c3 r3c3 r4c3 r5c3 |
Screen Output | Transposed table: result has 5 columns and 3 rows |
Split big FASTA file into smaller files (change_split_fasta)
Split one big FASTA file into multiple smaller ones. If the output filename template is
small_NUMBER.fasta
, the output files will be called small_1.fasta
, small_2.fasta
, etc.
Example: Split
big.fasta
, with five sequences, into two files, small_1.fasta
and small_2.fasta
. (Since there are only five sequences, the second file has only two sequences in it.)Input file ( big.fasta ) | >seq1 ACCTTGTCGCA >seq2 ACCTTGTCGCAAAGC >seq3 ACCTTGTCGCACCGGAACGA >seq4 ACCTTGTCGCACCGGAACGACCGGAACGA >seq5 GTCGCA |
---|---|
Output file 1 ( small_1.fasta ) | >seq1 ACCTTGTCGCA >seq2 ACCTTGTCGCAAAGC >seq3 ACCTTGTCGCACCGGAACGA |
Output file 2 ( small_2.fasta ) | >seq4 ACCTTGTCGCACCGGAACGACCGGAACGA >seq5 GTCGCA |
Screen Output | Split 5 FASTA records in 10 lines, with total sequence length 81 Created 2 files like small_2.fasta |
No comments:
Post a Comment