Friday, October 7, 2022

split large fasta file into multiple fasta files on any computer

This is from https://sites.psu.edu/biomonika/2017/03/09/split-large-fasta-file-into-multiple-fasta-files-on-any-computer/

You just downloaded your favorite genome. It’s huge fasta file and now you want to do something with it, but running it on this single file would take forever. You need to split it by chromosomes or contigs, but you are not on your favorite computer where your magic toolbox is installed. Worry no more! As always, UNIX has many built-in tools, that can do anything, but people rarely talk about them.

csplit panTro5.fa /\>chr.*/ {*} 
#every sequence is now in one file named xxx something

for a in x*; do echo $a; mv $a $(head -1 $a).txt; done; 
#we just renamed each file by using first line of each file

Of course, if you fasta header contained anything other than >chr in the header, you would modify you csplit command and replace chr with whatever characters your headers start with. Luckily, you can use csplit to split any files, making it robust and useful tool for bioinformatics that is being flooded by new file formats as we speak..

Thursday, June 22, 2017

Install the linux brew

Step1:

git clone https://github.com/Homebrew/linuxbrew.git ~/.linuxbrew

Step 2 - Update environment variables

Add the following lines to the end of the user's ~/.bashrc file:

# Until LinuxBrew is fixed, the following is required.
# See: https://github.com/Homebrew/linuxbrew/issues/47
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/local/lib64/pkgconfig:/usr/lib64/pkgconfig:/usr/lib/pkgconfig:/usr/lib/x86_64-linux-gnu/pkgconfig:/usr/lib64/pkgconfig:/usr/share/pkgconfig:$PKG_CONFIG_PATH
## Setup linux brew
export LINUXBREWHOME=$HOME/.linuxbrew
export PATH=$LINUXBREWHOME/bin:$PATH
export MANPATH=$LINUXBREWHOME/man:$MANPATH
export PKG_CONFIG_PATH=$LINUXBREWHOME/lib64/pkgconfig:$LINUXBREWHOME/lib/pkgconfig:$PKG_CONFIG_PATH
export LD_LIBRARY_PATH=$LINUXBREWHOME/lib64:$LINUXBREWHOME/lib:$LD_LIBRARY_PATH

Step 3: open a new terminal window, and type brew. If you want to install tools, please use brew install

Reference:

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-linuxbrew-on-a-linux-vps

Wednesday, December 21, 2016

Tips for converting ped and map to vcf files using Plink

Plink is a nice tools to convert different file formats, for example, map or ped to vcf files.

If you have map or ped file (called: WBDC_Bopa1and2.ped and WBDC_Bopa1and2.map ), and you can use a command line --recode to do converting:

plink --file WBDC_Bopa1and2 --allow-extra-chr --recode vcf --out test

Since I have unknown chromosome (Unchr), I need to add --allow-extra-chr.

Then you will get a vcf file named as test.vcf.

But if you compare this vcf file with your refered SNPs vcf files (sorted_all_BOPA_masked_95idt.vcf), you will find there are issues in this test.vcf files; for example, the strand is unflipped and some of the reference alleles are assigned as alternative alleles.

Then you need to think about which variants need to flip stand, and which one need to force reference, and which one need to flip stand first and then force reference.

vcftools has a nice function --diff to compare two vcf files and let you figure out which SNPs are common and which one is discordance, and which one is unique called, etc.

Here we can use --diff --diff-site to compare those two vcf files. Before we run bcftools --diff, we need to sort those vcf files. We can use a command called vcf-sort from vcftools to sort the

vcf-sort test.vcf >sorted_test.vcf

Then we can compare those two sorted vcf files with vcftools:

vcftools --vcf sorted_test.vcf --diff sorted_all_BOPA_masked_95idt.vcf --diff-site --out test_BOPA

Then you will get a file test_BOPA.diff.sites_in_files:

B: means both file get the same SNPs calling and sites

2: one file unique called SNPs

O: Inconcordance SNPS called in those two files

Then use awk to extract the disconcordance SNPs:

awk -F"\t" -v OFS="\t" '$4 == "O" {print $0}' test_BOPA.diff.sites_in_files >diff_test_BOPA

Check which ones are flipped and get the SNP ids:

grep -f <(awk -F"\t" -v OFS="\t" 'BEGIN {complement["A"] = "T"; complement["T"] = "A"; complement["C"] = "G"; complement["G"] = "C";} $5 == complement[$8] && $6 == complement[$7]{print $0}' diff_test_BOPA |cut -f1,2) sorted_all_BOPA_masked_95idt.vcf >flipped_SNPs_list

Then flipped the strand with plink:

plink --vcf test.vcf --allow-extra-chr --flip one_fliped_allele --keep-allele-order --recode vcf --out flipped_test

--keep-allele-order is most important flag when we flip strand or force reference alleles. Without this flag, the vcf file will come back to the old status.

When we got the flipped_test.vcf, we compared the vcf files with the reference one again.

vcftools --vcf sorted_test.vcf --diff sorted_all_BOPA_masked_95idt.vcf --diff-site --out flipped_test_BOPA

Find the difference:

awk -F"\t" -v OFS="\t" '$4 == "O" {print $0}' test_BOPA.diff.sites_in_files >diff_flipped_test_BOPA

Get the SNP_ID which need to force reference alleles:

grep -f <(awk -F"\t" -v OFS="\t" '$5 == $8 && $6 == $7 { print $0 }' <(grep "O" diff_flipped_test_BOPA.diff.sites_in_files)|cut -f1,2) sorted_all_BOPA_masked_95idt.vcf |cut -f3,4>forced_ref_alleles

forced_ref_alleles file is like this:

SNP_id Reference_allele

Then force reference alleles with plink:

plink --vcf flipped_test.vcf --allow-extra-chr --a2-allele forced_ref_alleles --keep-allele-order --recode vcf --out forced_ref_flipped_test

Here be careful of --a2-allele, which is reference allele, and also always add --keep-allele-order, which will help to change the forced_reference allele status.

Hopefully this will help you to convert files easily!!!

Wednesday, August 12, 2015

open a file with R though browser

data <- read.delim( file.choose(), sep="\t");
head(data)
summary(data)

Thursday, July 16, 2015

Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

Leave a reply

I want to randomly select a specific number of rows from a fastq file pair, within a Perl script. After some Googling I was surprised not to find a simple (and scalable) method using Perl alone, at least for potentially huge files. The best suggestion looked to be something I gleaned on PerlMonks, posted in 2009 I think, which involved creating an array of numbers (an index), shuffling it, and then taking what you need. This breaks down if you want to sample from a huge range—even if you only need a small sample (e.g., 1000 random integers from the range (1, 10000000000). Proposed workarounds to the “without replacement” requirement with a huge range involved generating random numbers and storing them in a hash to check for repeats, but this gets slower as you sample a larger and larger proportion of the range.

The ‘sample’ function in R does this very efficiently, and it’s even faster using ‘sample.int,’ if integers are what you’re after. Calling these functions from Perl is pretty simple using the Statistics::R module.

First, let’s see how these work just in R:

sample(1:1000000000, 10)

 [1] 708203128 315330985 962188958  49319866 132041223 226927166 810373363 790653185 876269370 494039577

sample.int(1000000000, 10)

 [1] 730459751 994357917  69492291 565544309 450504715 334939544 190179439 218113785 808402368 289432141

To use these functions in Perl you need to install the Statistics::R module, and to use ‘shuffle’ you’ll need List::Util. So let’s compare the suggested Perl-only approach to the two Perl-R approaches:

#!/usr/bin/perl

use strict;

use warnings;

use Statistics::R;

use List::Util 'shuffle';

my $max_number = 100000000;

# The perl way

my $start = time;

my @my_array = 1..$max_number;

my @shuffled_array = shuffle(@my_array);

print "\nShuffle approach:\t", (time - $start),"sec\n";

print "\n@shuffled_array[1..10]\n";

# R way 1

$start = time;

my $R = Statistics::R->new();

$R->set('x', $max_number);

$R->run( q'sample_for_perl = sample(1:x, 10)' );

my $Rsample = $R->get('sample_for_perl');

print "\nR sample:\t\t", (time - $start),"sec\n";

print "@$Rsample\n";

$R->stop();

# R way 2

$start = time;

my $R2 = Statistics::R->new();

$R2->set('x', $max_number);

$R2->run( q'sample_for_perl = sample.int(x, 10)' );

my $Rsample2 = $R2->get('sample_for_perl');

print "\nR sample.int:\t\t", (time - $start),"sec\n";

print "@$Rsample2\n";

$R2->stop();

# Shuffle approach: 70sec

  41313044 28177140 86142374 84745209 32922084 40075115 41773959 71032809 53190908 70537181

# R sample:         3sec

  3569784 94379881 22320132 19199157 54432698 53404741 70516998 28210788 96616635 22291399

# R sample.int:     0sec

  66635186 41389081 94760016 89810397 22677469 90768414 89380410 69136310 65013711 60463558

And here are some benchmarks over various array sizes:

That’s a pretty substantial speed-up! To be fair, if you needed a large sample from a huge range you would likely need to load these into an array after generating them in R, but it would still be a massive savings in time and memory not having to shuffle. I’m quite interested in hearing drawbacks to my approach, or better/more modern alternatives in Perl alone.

evolution & bioinformatics

Friday, October 7, 2022

split large fasta file into multiple fasta files on any computer

This is from https://sites.psu.edu/biomonika/2017/03/09/split-large-fasta-file-into-multiple-fasta-files-on-any-computer/

Thursday, June 22, 2017

Install the linux brew

Wednesday, December 21, 2016

Tips for converting ped and map to vcf files using Plink

Wednesday, August 12, 2015

open a file with R though browser

Thursday, July 16, 2015

Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

referred from https://jpwendler.wordpress.com/2013/12/30/integrating-r-and-perl-to-speed-up-random-sampling-without-replacement-of-a-huge-numeric-range/

Integrating R and Perl to speed up random sampling, without replacement, of a huge numeric range

About Me