evolution & bioinformatics

Li's blogger

Jump to Content

http://davetang.org/wiki/tiki-index.php?page=SAMTools#Converting_a_SAM_file_to_a_BAM_file

SAMTools

SAMTools provides various tools for manipulating alignments in the SAM/BAM format. The SAM (Sequence Alignment/Map) format (BAM is just the binary form of SAM) is currently the de facto standard for storing large nucleotide sequence alignments. If you are dealing with high-throughput sequencing data, at some point you will probably have to deal with SAM/BAM files, so familiarise yourself with them!

Here are some basic commands to get your started. For more information on the SAM/BAM format see this page.

Table of contents

Basic usage
Converting a SAM file to a BAM file
Sorting a BAM file
Converting SAM directly to a sorted BAM file
Creating a BAM index file
Converting a BAM file to a SAM file
Filtering out unmapped reads in BAM files
Extracting SAM entries mapping to a specific loci
Extracting only the first read from paired end BAM files
Simple stats using SAMTools flagstat
Interpreting the BAM flags
SAMTools calmd/fillmd
Creating FASTQ files from a BAM file
Random subsampling of BAM file
More information

BASIC USAGE

samtools
samtools <command> [options]

If you run samtools on the terminal without any parameters, all the available utilities are listed:

samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.18 (r982:295)

Usage: samtools <command> [options]

Command: view SAM<->BAM conversion
sort sort alignment file
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
tview text alignment viewer
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes

Most of the times I just use view, sort, index and flagstats.

CONVERTING A SAM FILE TO A BAM FILE

A BAM file is just a SAM file but stored in binary; you should always convert your SAM files into BAM to save storage space and BAM files are faster to manipulate.

To get started, view the first couple of lines of your SAM file by typing on the terminal:

shell
head test.sam

The first 10 lines on your terminal after typing "head test.sam", should be lines starting with the "@" sign, which is an indicator for a header line. If you don't see lines starting with the "@" sign, the header information is most likely missing.

If the header is absent from the SAM file use the command below, where reference.fa is the reference fasta file used to map the reads:

samtools
samtools view -bT reference.fa test.sam > test.bam

If the header information is available:

samtools
samtools view -bS test.sam > test.bam

SORTING A BAM FILE

Always sort your BAM files; many downstream programs only take sorted BAM files.

samtools
samtools sort test.bam test_sorted

CONVERTING SAM DIRECTLY TO A SORTED BAM FILE

Like many Unix tools, SAMTools is able to read directly from a stream i.e. stdout.

samtools
samtools view -bS file.sam | samtools sort - file_sorted

CREATING A BAM INDEX FILE

A BAM index file is usually needed when visualising a BAM file.

samtools
samtools index test_sorted.bam test_sorted.bai

CONVERTING A BAM FILE TO A SAM FILE

Note: remember to use -h to ensure the converted SAM file contains the header information. Generally, I recommend storing only sorted BAM files as they use much less disk space and are faster to process.

samtools
samtools view -h NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam > NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.sam

FILTERING OUT UNMAPPED READS IN BAM FILES

samtools
samtools view -h -F 4 blah.bam > blah_only_mapped.sam
OR
samtools view -h -F 4 -b blah.bam > blah_only_mapped.bam

EXTRACTING SAM ENTRIES MAPPING TO A SPECIFIC LOCI

Quite commonly, we want to extract all reads that map within a specific genomic loci.

First make a BED file, which in its simplest form is a tab-delimited file with 3 columns. Say you want to extract reads mapping within the genomic coordinates 250,000 to 260,000 on chromosome 1. Then your BED file (which I will call test.bed) will be:

chr1 250000 260000

Then run samtools as such:

samtools
samtools view -L test.bed test.bam | awk '$2 != 4 {print}'
OR
samtools view -S -L test.bed test.sam | awk '$2 != 4 {print}'
OR
samtools view -F 4 -S -L test.bed test.sam

Note: even if you put strand information into your BED file, the above command ignores strandedness, so you get all reads mapping within the region specified by the BED file, regardless of whether it maps to the positive or negative strand. The AWK line above is to remove unmapped reads, which also get returned or you could add the -F 4 parameter.

Alternatively one can use BEDTools:

BEDTools
intersectBed -abam file.bam -b test.bed

EXTRACTING ONLY THE FIRST READ FROM PAIRED END BAM FILES

Sometimes you only want the first pair of a mate

samtools
samtools view -h -f 0x0040 test.bam > test_first_pair.sam

0x0040 is hexadecimal for 64 (i.e. 16 * 4), which is binary for 1000000, corresponding to the read in the first read pair.

SIMPLE STATS USING SAMTOOLS FLAGSTAT

The flagstat command provides simple statistics on a BAM file. Here I downloaded a BAM file from the 1000 genome project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam

Then simply:

samtools flagstat
samtools flagstat NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam

1 6874858 + 0 in total (QC-passed reads + QC-failed reads)
2 90281 + 0 duplicates
3 6683299 + 0 mapped (97.21%)
4 6816083 + 0 paired in sequencing
5 3408650 + 0 read1
6 3407433 + 0 read2
7 6348470 + 0 properly paired (93.14NaV)
8 6432965 + 0 with itself and mate mapped
9 191559 + 0 singletons (2.81NaV)
10 57057 + 0 with mate mapped to a different chr
11 45762 + 0 with mate mapped to a different chr (mapQ>=5)

To confirm some of the numbers returned by flagstat:

flagstat
samtools view NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
6874858 #same as line 1

samtools view -F 4 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
6683299 #same as line 3

samtools view -f 0x0040 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
3408650 #same as line 5

samtools view -f 0x0080 NA06984.chrom20.ILLUMINA.bwa.CEU.low_coverage.20111114.bam | wc
3407433 #same as line 6

and so on...

Here's an example of a properly paired read:

SRR035024.17204235 163 20 60126 60 68M8S = 60466 383 CCACCATGGACCTCTGGGATCCTAGCTTTAAGAGATCCCATCACCCACATGAACGTTTGAATTGACAGGGGGAGCG @FEBBABEEDDGIGJIIIHIHJKHIIKAEHKEHEHI>JIJBDJHIJJ5CIFH4;;9=CDB8F8F>5B######### X0:i:1 X1:i:0 XC:i:68 MD:Z:68 RG:Z:SRR035024 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:BH@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
SRR035024.17204235 83 20 60466 60 32S44M = 60126 -383 GGCCTCCCCCCGGGCCCCTCTTGTGTGCACACAGCACAGCCTCTACTGCTACACCTGAGTACTTTGCCAGTGGCCT #################################>D:LIJEJBJIFJJJJIHKIJJJIKHIHIKJJIJJKGIIFEDB X0:i:1 X1:i:0 XC:i:44 MD:Z:44 RG:Z:SRR035024 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

For more statistics of SAM/BAM files see the SAMStat program.

See the section "Interpreting the BAM flags" below for more information on BAM flags.

INTERPRETING THE BAM FLAGS

The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false.

Here are some common BAM flags:

163: 10100011 in binary
147: 10010011 in binary
99: 1100011 in binary
83: 1010011 in binary

Interpretation of 10100011 (reading the binary from left to right):

1 the read is paired in sequencing, no matter whether it is mapped in a pair
1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
0 the query sequence itself is unmapped
0 the mate is unmapped
0 strand of the query (0 for forward; 1 for reverse strand)
1 strand of the mate
0 the read is the first read in a pair
1 the read is the second read in a pair

163 second read of a pair on the positive strand with negative strand mate
147 second read of a pair on the negative strand with positive strand mate
99 first read of a pair on the forward strand with negative strand mate
83 first read of a pair on the reverse strand with positive strand mate

IF you observe a BAM flag of 512 (0x200), it indicates a read that did not pass quality control.

Below is a very raw Perl script that may help interpret BAM flags. I make the assumption that when a read or its mate is not flagged as unmapped (i.e. having a 0 value), there is information regarding its strandedness. Also if a read is flagged as not having a proper pair, I ignore the flags that store information for the mate.

bam_flag.pl
#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <bam_flag>\n";
my $bam_flag = shift or die $usage;

my $binary = dec2bin($bam_flag);
$binary = reverse($binary);

#163 -> 10100011

my @desc = (
'The read is paired in sequencing',
'The read is mapped in a proper pair',
'The query sequence itself is unmapped',
'The mate is unmapped',
'strand of the query',
'strand of the mate',
'The read is the first read in a pair',
'The read is the second read in a pair',
'The alignment is not primery (a read having split hits may have multiple primary alignment records)',
'The read fails quality checks',
'The read is either a PCR duplicate or an optical duplicate',
);

my $query_mapped = '0';
my $mate_mapped = '0';
my $proper_pair = '0';

print "$binary\n";

for (my $i=0; $i< length($binary); ++$i){
my $flag = substr($binary,$i,1);
#print "\$i = $i and \$flag = $flag\n";
if ($i == 1){
if ($flag == 1){
$proper_pair = '1';
}
}
if ($i == 2){
if ($flag == 0){
$query_mapped = '1';
}
}
if ($i == 3){
if ($flag == 0){
$mate_mapped = '1';
}
}
if ($i == 4){
next if $query_mapped == 0;
if ($flag == 0){
print "The read is mapped on the forward strand\n";
} else {
print "The read is mapped on the reverse strand\n";
}
} elsif ($i == 5){
next if $mate_mapped == 0;
next if $proper_pair == 0;
if ($flag == 0){
print "The mate is mapped on the forward strand\n";
} else {
print "The mate is mapped on the reverse strand\n";
}
} else {
if ($flag == 1){
print "$desc[$i]\n";
}
}
}

exit(0);

#from The Perl Cookbook
sub dec2bin {
my $str = unpack("B32", pack("N", shift));
$str =~ s/^0+(?=\d)//; # otherwise you'll get leading zeros
return $str;
}

__END__

SAMTOOLS CALMD/FILLMD

The calmd or fillmd tool is useful for visualising mismatches and insertions in an alignment of a read to a reference genome. For example:

fillmd
#random BAM file mapped with BWA
samtools view blah.bam

blah 0 chr1 948827 37 1M2D17M1I8M * 0 0 GGTGTGTGCCTCAGGCTTAATAATAGG ggcgde`a`egfbfgghdfa_cfea^_ XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:1 MD:Z:1^AC25

#the -e changes identical bases between the read and reference into ='s
samtools view -b blah.bam | samtools fillmd -e - ref.fa

blah 0 chr1 948827 37 1M2D17M1I8M * 0 0 ==================A======== ggcgde`a`egfbfgghdfa_cfea^_ XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:1 MD:Z:1^AC25

The original read sequence is:

blah G--GTGTGTGCCTCAGGCTTAATAATAGG
| |||||||||||||||||| |||||||
genome GACGTGTGTGCCTCAGGCTTA-TAATAGG

The CIGAR string (1M2D17M1I8M) is shown with respect to the read, i.e. a deletion means a deletion in the read and an insertion is an insertion in the read, as you can see above. SAMTools fillmd shows the insertions (as above) and mismatches (not in the example above) but not deletions.

CREATING FASTQ FILES FROM A BAM FILE

I found this great tool at http://www.hudsonalpha.org/gsl/software/bam2fastq.php

For example to extract ONLY unaligned from a bam file:

samtools
bam2fastq -o blah_unaligned.fastq --no-aligned blah.bam

To extract ONLY aligned reads from a bam file:

samtools
bam2fastq -o blah_aligned.fastq --no-unaligned blah.bam

RANDOM SUBSAMPLING OF BAM FILE

The SAMTools view -s parameter allows you to randomly sample lines of a BAM file

samtools view -s 0.5 -b file.bam > random_half_of_file.sam

MORE INFORMATION

http://samtools.sourceforge.net/

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ

Comments
Powered by Tiki Wiki CMS Groupware | | Theme: Feb12
Wiki

Li's blogger

Software packages for next gen sequence analysis

28 Dec 2009: This thread has been closed. Please see our wiki software portal for information about each of these packages.

A reasonably thorough table of next-gen-seq software available in the commercial and public domain

Integrated solutions
* CLCbio Genomics Workbench - de novo and reference assembly of Sanger, Roche FLX, Illumina, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, CHiP-seq, browser and other features. Commercial. Windows, Mac OS X and Linux.
* Galaxy - Galaxy = interactive and reproducible genomics. A job webportal.
* Genomatix - Integrated Solutions for Next Generation Sequencing data analysis.
* JMP Genomics - Next gen visualization and statistics tool from SAS. They are working with NCGR to refine this tool and produce others.
* NextGENe - de novo and reference assembly of Illumina, SOLiD and Roche FLX data. Uses a novel Condensation Assembly Tool approach where reads are joined via "anchors" into mini-contigs before assembly. Includes SNP detection, CHiP-seq, browser and other features. Commercial. Win or MacOS.
* SeqMan Genome Analyser - Software for Next Generation sequence assembly of Illumina, Roche FLX and Sanger data integrating with Lasergene Sequence Analysis software for additional analysis and visualization capabilities. Can use a hybrid templated/de novo approach. Commercial. Win or Mac OS X.
* SHORE - SHORE, for Short Read, is a mapping and analysis pipeline for short DNA sequences produced on a Illumina Genome Analyzer. A suite created by the 1001 Genomes project. Source for POSIX.
* SlimSearch - Fledgling commercial product.

Align/Assemble to a reference
* BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.
* Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.
* BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.
* ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.
* Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.
* GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.
* GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.
* gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.
* MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source
* MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX
* MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.
* MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.
* Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.
* PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.
* RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.
* SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.
* SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.
* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.
* SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.
* SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.
* SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.
* SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)
* SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.
* Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.
* Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.

De novo Align/Assemble
* ABySS - Assembly By Short Sequences. ABySS is a de novo sequence assembler that is designed for very short reads. The single-processor version is useful for assembling genomes up to 40-50 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes. By Simpson JT and others at the Canada's Michael Smith Genome Sciences Centre. C++ as source.
* ALLPATHS - ALLPATHS: De novo assembly of whole-genome shotgun microreads. ALLPATHS is a whole genome shotgun assembler that can generate high quality assemblies from short reads. Assemblies are presented in a graph form that retains ambiguities, such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. Broad Institute.
* Edena - Edena (Exact DE Novo Assembler) is an assembler dedicated to process the millions of very short reads produced by the Illumina Genome Analyzer. Edena is based on the traditional overlap layout paradigm. By D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel. Linux/Win.
* EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach.
* MIRA2 - MIRA (Mimicking Intelligent Read Assembly) is able to perform true hybrid de-novo assemblies using reads gathered through 454 sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa and Sanger data. Linux OS required.
* SEQAN - A Consistency-based Consensus Algorithm for De Novo and Reference-guided Sequence Assembly of Short Reads. By Tobias Rausch and others. C++, Linux/Win.
* SHARCGS - De novo assembly of short reads. Authors are Dohm JC, Lottaz C, Borodina T and Himmelbauer H. from the Max-Planck-Institute for Molecular Genetics.
* SSAKE - The Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE) is a genomics application for aggressively assembling millions of short nucleotide sequences by progressively searching for perfect 3'-most k-mers using a DNA prefix tree. Authors are René Warren, Granger Sutton, Steven Jones and Robert Holt from the Canada's Michael Smith Genome Sciences Centre. Perl/Linux.
* SOAPdenovo - Part of the SOAP suite. See above.
* VCAKE - De novo assembly of short reads with robust error correction. An improvement on early versions of SSAKE.
* Velvet - Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. Need about 20-25X coverage and paired reads. Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI).

SNP/Indel Discovery
* ssahaSNP - ssahaSNP is a polymorphism detection tool. It detects homozygous SNPs and indels by aligning shotgun reads to the finished genome sequence. Highly repetitive elements are filtered out by ignoring those kmer words with high occurrence numbers. More tuned for ABI Sanger reads. Developers are Adam Spargo and Zemin Ning from the Sanger Centre. Compaq Alpha, Linux-64, Linux-32, Solaris and Mac
* PolyBayesShort - A re-incarnation of the PolyBayes SNP discovery tool developed by Gabor Marth at Washington University. This version is specifically optimized for the analysis of large numbers (millions) of high-throughput next-generation sequencer reads, aligned to whole chromosomes of model organism or mammalian genomes. Developers at Boston College. Linux-64 and Linux-32.
* PyroBayes - PyroBayes is a novel base caller for pyrosequences from the 454 Life Sciences sequencing machines. It was designed to assign more accurate base quality estimates to the 454 pyrosequences. Developers at Boston College.

Genome Annotation/Genome Browser/Alignment Viewer/Assembly Database
* EagleView - An information-rich genome assembler viewer. EagleView can display a dozen different types of information including base quality and flowgram signal. Developers at Boston College.
* LookSeq - LookSeq is a web-based application for alignment visualization, browsing and analysis of genome sequence data. LookSeq supports multiple sequencing technologies, alignment sources, and viewing modes; low or high-depth read pileups; and easy visualization of putative single nucleotide and structural variation. From the Sanger Centre.
* MapView - MapView: visualization of short reads alignment on desktop computer. From the Evolutionary Genomics Lab at Sun-Yat Sen University, China. Linux.
* SAM - Sequence Assembly Manager. Whole Genome Assembly (WGA) Management and Visualization Tool. It provides a generic platform for manipulating, analyzing and viewing WGA data, regardless of input type. Developers are Rene Warren, Yaron Butterfield, Asim Siddiqui and Steven Jones at Canada's Michael Smith Genome Sciences Centre. MySQL backend and Perl-CGI web-based frontend/Linux.
* STADEN - Includes GAP4. GAP5 once completed will handle next-gen sequencing data. A partially implemented test version is available here
* XMatchView - A visual tool for analyzing cross_match alignments. Developed by Rene Warren and Steven Jones at Canada's Michael Smith Genome Sciences Centre. Python/Win or Linux.

Counting e.g. CHiP-Seq, Bis-Seq, CNV-Seq
* BS-Seq - The source code and data for the "Shotgun Bisulphite Sequencing of the Arabidopsis Genome Reveals DNA Methylation Patterning" Nature paper by Cokus et al. (Steve Jacobsen's lab at UCLA). POSIX.
* CHiPSeq - Program used by Johnson et al. (2007) in their Science publication
* CNV-Seq - CNV-seq, a new method to detect copy number variation using high-throughput sequencing. Chao Xie and Martti T Tammi at the National University of Singapore. Perl/R.
* FindPeaks - perform analysis of ChIP-Seq experiments. It uses a naive algorithm for identifying regions of high coverage, which represent Chromatin Immunoprecipitation enrichment of sequence fragments, indicating the location of a bound protein of interest. Original algorithm by Matthew Bainbridge, in collaboration with Gordon Robertson. Current code and implementation by Anthony Fejes. Authors are from the Canada's Michael Smith Genome Sciences Centre. JAVA/OS independent. Latest versions available as part of the Vancouver Short Read Analysis Package
* MACS - Model-based Analysis for ChIP-Seq. MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. Written by Yong Zhang and Tao Liu from Xiaole Shirley Liu's Lab.
* PeakSeq - PeakSeq: Systematic Scoring of ChIP-Seq Experiments Relative to Controls. a two-pass approach for scoring ChIP-Seq data relative to controls. The first pass identifies putative binding sites and compensates for variation in the mappability of sequences across the genome. The second pass filters out sites that are not significantly enriched compared to the normalized input DNA and computes a precise enrichment and significance. By Rozowsky J et al. C/Perl.
* QuEST - Quantitative Enrichment of Sequence Tags. Sidow and Myers Labs at Stanford. From the 2008 publication Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. (C++)
* SISSRs - Site Identification from Short Sequence Reads. BED file input. Raja Jothi @ NIH. Perl.
**See also this thread for ChIP-Seq, until I get time to update this list.

Alternate Base Calling
* Rolexa - R-based framework for base calling of Solexa data. Project publication
* Alta-cyclic - "a novel Illumina Genome-Analyzer (Solexa) base caller"

Transcriptomics
* ERANGE - Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq. Supports Bowtie, BLAT and ELAND. From the Wold lab.
* G-Mo.R-Se - G-Mo.R-Se is a method aimed at using RNA-Seq short reads to build de novo gene models. First, candidate exons are built directly from the positions of the reads mapped on the genome (without any ab initio assembly of the reads), and all the possible splice junctions between those exons are tested against unmapped reads. From CNS in France.
* MapNext - MapNext: A software tool for spliced and unspliced alignments and SNP detection of short sequence reads. From the Evolutionary Genomics Lab at Sun-Yat Sen University, China.
* QPalma - Optimal Spliced Alignments of Short Sequence Reads. Authors are Fabio De Bona, Stephan Ossowski, Korbinian Schneeberger, and Gunnar Rätsch. A paper is available.
* RSAT - RSAT: RNA-Seq Analysis Tools. RNASAT is developed and maintained by Hui Jiang at Stanford University.
* TopHat - TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. TopHat is a collaborative effort between the University of Maryland and the University of California, Berkeley
last edited 16 Jun 09, Sci_guy

evolution & bioinformatics

Thursday, June 20, 2013

SAMTools

Wednesday, June 5, 2013

Software packages for next gen sequence analysis

About Me