If you have map or ped file (called: WBDC_Bopa1and2.ped and WBDC_Bopa1and2.map ), and you can use a command line --recode to do converting:
plink --file WBDC_Bopa1and2 --allow-extra-chr --recode vcf --out test
Since I have unknown chromosome (Unchr), I need to add --allow-extra-chr.
Then you will get a vcf file named as test.vcf.
But if you compare this vcf file with your refered SNPs vcf files (sorted_all_BOPA_masked_95idt.vcf), you will find there are issues in this test.vcf files; for example, the strand is unflipped and some of the reference alleles are assigned as alternative alleles.
Then you need to think about which variants need to flip stand, and which one need to force reference, and which one need to flip stand first and then force reference.
vcftools has a nice function --diff to compare two vcf files and let you figure out which SNPs are common and which one is discordance, and which one is unique called, etc.
Here we can use --diff --diff-site to compare those two vcf files. Before we run bcftools --diff, we need to sort those vcf files. We can use a command called vcf-sort from vcftools to sort the
Then we can compare those two sorted vcf files with vcftools:
Then you will get a file test_BOPA.diff.sites_in_files:
B: means both file get the same SNPs calling and sites
2: one file unique called SNPs
O: Inconcordance SNPS called in those two files
Then use awk to extract the disconcordance SNPs:
awk -F"\t" -v OFS="\t" '$4 == "O" {print $0}' test_BOPA.diff.sites_in_files >diff_test_BOPA
Check which ones are flipped and get the SNP ids:
Then flipped the strand with plink:
plink --vcf test.vcf --allow-extra-chr --flip one_fliped_allele --keep-allele-order --recode vcf --out flipped_test
plink --vcf test.vcf --allow-extra-chr --flip one_fliped_allele --keep-allele-order --recode vcf --out flipped_test
--keep-allele-order is most important flag when we flip strand or force reference alleles. Without this flag, the vcf file will come back to the old status.
When we got the flipped_test.vcf, we compared the vcf files with the reference one again.
vcftools --vcf sorted_test.vcf --diff sorted_all_BOPA_masked_95idt.vcf --diff-site --out flipped_test_BOPA
Find the difference:
awk -F"\t" -v OFS="\t" '$4 == "O" {print $0}' test_BOPA.diff.sites_in_files >diff_flipped_test_BOPA
Get the SNP_ID which need to force reference alleles:
grep -f <(awk -F"\t" -v OFS="\t" '$5 == $8 && $6 == $7 { print $0 }' <(grep "O" diff_flipped_test_BOPA.diff.sites_in_files)|cut -f1,2) sorted_all_BOPA_masked_95idt.vcf |cut -f3,4>forced_ref_alleles
forced_ref_alleles file is like this:
SNP_id Reference_allele
Then force reference alleles with plink:
plink --vcf flipped_test.vcf --allow-extra-chr --a2-allele forced_ref_alleles --keep-allele-order --recode vcf --out forced_ref_flipped_test
Here be careful of --a2-allele, which is reference allele, and also always add --keep-allele-order, which will help to change the forced_reference allele status.
Hopefully this will help you to convert files easily!!!
No comments:
Post a Comment