Jump to Content
http://davetang.org/wiki/tiki-index.php?page=SAMTools#Converting_a_SAM_file_to_a_BAM_file
SAMTools
SAMTools provides various tools for manipulating alignments in the SAM/BAM format. The SAM (Sequence Alignment/Map) format (BAM is just the binary form of SAM) is currently the de facto standard for storing large nucleotide sequence alignments. If you are dealing with high-throughput sequencing data, at some point you will probably have to deal with SAM/BAM files, so familiarise yourself with them!
Here are some basic commands to get your started. For more information on the SAM/BAM format see this page.
Table of contents
Basic usage
Converting a SAM file to a BAM file
Sorting a BAM file
Converting SAM directly to a sorted BAM file
Creating a BAM index file
Converting a BAM file to a SAM file
Filtering out unmapped reads in BAM files
Extracting SAM entries mapping to a specific loci
Extracting only the first read from paired end BAM files
Simple stats using SAMTools flagstat
Interpreting the BAM flags
SAMTools calmd/fillmd
Creating FASTQ files from a BAM file
Random subsampling of BAM file
More information
BASIC USAGE
samtools
samtools <command> [options]
If you run samtools on the terminal without any parameters, all the available utilities are listed:
samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)
Usage: samtools <command> [options]
Command: view SAM<->BAM conversion
sort sort alignment file
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
tview text alignment viewer
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes
Most of the times I just use view, sort, index and flagstats.
CONVERTING A SAM FILE TO A BAM FILE
A BAM file is just a SAM file but stored in binary; you should always convert your SAM files into BAM to save storage space and BAM files are faster to manipulate.
To get started, view the first couple of lines of your SAM file by typing on the terminal:
shell
head test.sam
The first 10 lines on your terminal after typing "head test.sam", should be lines starting with the "@" sign, which is an indicator for a header line. If you don't see lines starting with the "@" sign, the header information is most likely missing.
If the header is absent from the SAM file use the command below, where reference.fa is the reference fasta file used to map the reads:
samtools
samtools view -bT reference.fa test.sam > test.bam
If the header information is available:
samtools
samtools view -bS test.sam > test.bam
SORTING A BAM FILE
Always sort your BAM files; many downstream programs only take sorted BAM files.
samtools
samtools sort test.bam test_sorted
CONVERTING SAM DIRECTLY TO A SORTED BAM FILE
Like many Unix tools, SAMTools is able to read directly from a stream i.e. stdout.
samtools
samtools view -bS file.sam | samtools sort - file_sorted
CREATING A BAM INDEX FILE
A BAM index file is usually needed when visualising a BAM file.
samtools
samtools index test_sorted.bam test_sorted.bai
CONVERTING A BAM FILE TO A SAM FILE
Note: remember to use -h to ensure the converted SAM file contains the header information. Generally, I recommend storing only sorted BAM files as they use much less disk space and are faster to process.
samtools
samtools view -h NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam > NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.sam
FILTERING OUT UNMAPPED READS IN BAM FILES
samtools
samtools view -h -F 4 blah.bam > blah_only_mapped.sam
OR
samtools view -h -F 4 -b blah.bam > blah_only_mapped.bam
EXTRACTING SAM ENTRIES MAPPING TO A SPECIFIC LOCI
Quite commonly, we want to extract all reads that map within a specific genomic loci.
First make a BED file, which in its simplest form is a tab-delimited file with 3 columns. Say you want to extract reads mapping within the genomic coordinates 250,000 to 260,000 on chromosome 1. Then your BED file (which I will call test.bed) will be:
chr1 250000 260000
Then run samtools as such:
samtools
samtools view -L test.bed test.bam | awk '$2 != 4 {print}'
OR
samtools view -S -L test.bed test.sam | awk '$2 != 4 {print}'
OR
samtools view -F 4 -S -L test.bed test.sam
Note: even if you put strand information into your BED file, the above command ignores strandedness, so you get all reads mapping within the region specified by the BED file, regardless of whether it maps to the positive or negative strand. The AWK line above is to remove unmapped reads, which also get returned or you could add the -F 4 parameter.
Alternatively one can use BEDTools:
BEDTools
intersectBed -abam file.bam -b test.bed
EXTRACTING ONLY THE FIRST READ FROM PAIRED END BAM FILES
Sometimes you only want the first pair of a mate
samtools
samtools view -h -f 0x0040 test.bam > test_first_pair.sam
0x0040 is hexadecimal for 64 (i.e. 16 * 4), which is binary for 1000000, corresponding to the read in the first read pair.
SIMPLE STATS USING SAMTOOLS FLAGSTAT
The flagstat command provides simple statistics on a BAM file. Here I downloaded a BAM file from the 1000 genome project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
Then simply:
samtools flagstat
samtools flagstat NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam
1 6874858 + 0 in total (QC-passed reads + QC-failed reads)
2 90281 + 0 duplicates
3 6683299 + 0 mapped (97.21%)
4 6816083 + 0 paired in sequencing
5 3408650 + 0 read1
6 3407433 + 0 read2
7 6348470 + 0 properly paired (93.14NaV)
8 6432965 + 0 with itself and mate mapped
9 191559 + 0 singletons (2.81NaV)
10 57057 + 0 with mate mapped to a different chr
11 45762 + 0 with mate mapped to a different chr (mapQ>=5)
To confirm some of the numbers returned by flagstat:
flagstat
samtools view NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
6874858 #same as line 1
samtools view -F 4 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
6683299 #same as line 3
samtools view -f 0x0040 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
3408650 #same as line 5
samtools view -f 0x0080 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
3407433 #same as line 6
and so on...
Here's an example of a properly paired read:
SRR035024.17204235 163 20 60126 60 68M8S = 60466 383 CCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATGAACGTTTGAATTGACAGGGGGAGCG @FEBBABEEDDGIGJIIIHIHJKHIIKAEHKEHEHI>JIJBDJHIJJ5CIFH4;;9=CDB8F8F>5B######### X0:i:1 X1:i:0 XC:i:68 MD:Z:68 RG:Z:SRR035024 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:BH@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR035024.17204235 83 20 60466 60 32S44M = 60126 -383 GGCCTCCCCCCGGGCCCCTCTTGTGTGCACACAGCACAGCCTCTACTGCTACACCTGAGTACTTTGCCAGTGGCCT #################################>D:LIJEJBJIFJJJJIHKIJJJIKHIHIKJJIJJKGIIFEDB X0:i:1 X1:i:0 XC:i:44 MD:Z:44 RG:Z:SRR035024 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
For more statistics of SAM/BAM files see the SAMStat program.
See the section "Interpreting the BAM flags" below for more information on BAM flags.
INTERPRETING THE BAM FLAGS
The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false.
Here are some common BAM flags:
163: 10100011 in binary
147: 10010011 in binary
99: 1100011 in binary
83: 1010011 in binary
Interpretation of 10100011 (reading the binary from left to right):
1 the read is paired in sequencing, no matter whether it is mapped in a pair
1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
0 the query sequence itself is unmapped
0 the mate is unmapped
0 strand of the query (0 for forward; 1 for reverse strand)
1 strand of the mate
0 the read is the first read in a pair
1 the read is the second read in a pair
163 second read of a pair on the positive strand with negative strand mate
147 second read of a pair on the negative strand with positive strand mate
99 first read of a pair on the forward strand with negative strand mate
83 first read of a pair on the reverse strand with positive strand mate
IF you observe a BAM flag of 512 (0x200), it indicates a read that did not pass quality control.
Below is a very raw Perl script that may help interpret BAM flags. I make the assumption that when a read or its mate is not flagged as unmapped (i.e. having a 0 value), there is information regarding its strandedness. Also if a read is flagged as not having a proper pair, I ignore the flags that store information for the mate.
bam_flag.pl
#!/usr/bin/perl
use strict;
use warnings;
my $usage = "Usage: $0 <bam_flag>\n";
my $bam_flag = shift or die $usage;
my $binary = dec2bin($bam_flag);
$binary = reverse($binary);
#163 -> 10100011
my @desc = (
'The read is paired in sequencing',
'The read is mapped in a proper pair',
'The query sequence itself is unmapped',
'The mate is unmapped',
'strand of the query',
'strand of the mate',
'The read is the first read in a pair',
'The read is the second read in a pair',
'The alignment is not primery (a read having split hits may have multiple primary alignment records)',
'The read fails quality checks',
'The read is either a PCR duplicate or an optical duplicate',
);
my $query_mapped = '0';
my $mate_mapped = '0';
my $proper_pair = '0';
print "$binary\n";
for (my $i=0; $i< length($binary); ++$i){
my $flag = substr($binary,$i,1);
#print "\$i = $i and \$flag = $flag\n";
if ($i == 1){
if ($flag == 1){
$proper_pair = '1';
}
}
if ($i == 2){
if ($flag == 0){
$query_mapped = '1';
}
}
if ($i == 3){
if ($flag == 0){
$mate_mapped = '1';
}
}
if ($i == 4){
next if $query_mapped == 0;
if ($flag == 0){
print "The read is mapped on the forward strand\n";
} else {
print "The read is mapped on the reverse strand\n";
}
} elsif ($i == 5){
next if $mate_mapped == 0;
next if $proper_pair == 0;
if ($flag == 0){
print "The mate is mapped on the forward strand\n";
} else {
print "The mate is mapped on the reverse strand\n";
}
} else {
if ($flag == 1){
print "$desc[$i]\n";
}
}
}
exit(0);
#from The Perl Cookbook
sub dec2bin {
my $str = unpack("B32", pack("N", shift));
$str =~ s/^0+(?=\d)//; # otherwise you'll get leading zeros
return $str;
}
__END__
SAMTOOLS CALMD/FILLMD
The calmd or fillmd tool is useful for visualising mismatches and insertions in an alignment of a read to a reference genome. For example:
fillmd
#random BAM file mapped with BWA
samtools view blah.bam
blah 0 chr1 948827 37 1M2D17M1I8M * 0 0 GGTGTGTGCCTCAGGCTTAATAATAGG ggcgde`a`egfbfgghdfa_cfea^_ XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:1 MD:Z:1^AC25
#the -e changes identical bases between the read and reference into ='s
samtools view -b blah.bam | samtools fillmd -e - ref.fa
blah 0 chr1 948827 37 1M2D17M1I8M * 0 0 ==================A======== ggcgde`a`egfbfgghdfa_cfea^_ XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:1 MD:Z:1^AC25
The original read sequence is:
blah G--GTGTGTGCCTCAGGCTTAATAATAGG
| |||||||||||||||||| |||||||
genome GACGTGTGTGCCTCAGGCTTA-TAATAGG
The CIGAR string (1M2D17M1I8M) is shown with respect to the read, i.e. a deletion means a deletion in the read and an insertion is an insertion in the read, as you can see above. SAMTools fillmd shows the insertions (as above) and mismatches (not in the example above) but not deletions.
CREATING FASTQ FILES FROM A BAM FILE
I found this great tool at http://www.hudsonalpha.org/gsl/software/bam2fastq.php
For example to extract ONLY unaligned from a bam file:
samtools
bam2fastq -o blah_unaligned.fastq --no-aligned blah.bam
To extract ONLY aligned reads from a bam file:
samtools
bam2fastq -o blah_aligned.fastq --no-unaligned blah.bam
RANDOM SUBSAMPLING OF BAM FILE
The SAMTools view -s parameter allows you to randomly sample lines of a BAM file
samtools view -s 0.5 -b file.bam > random_half_of_file.sam
MORE INFORMATION
http://samtools.sourceforge.net/
http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ
Comments
Powered by Tiki Wiki CMS Groupware | | Theme: Feb12
Wiki