Chunk #3 — 1 INTRODUCTION — 1.2 NGS data preprocessing for SNV detection

Source: SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors.
Embedded: yes

Text

The data produced by NGS consists of millions of short reads ranging in length from approximately 30–400 nt (although this is steadily increasing with ongoing technology development). Here, we focus explicitly on the problem of inferring SNVs once these reads have been aligned to the genome. Numerous methods have been developed for short read alignment including Maq (Li,H. et al., 2008), BowTie (Langmead et al., 2009), ELAND (Illumina), SHRiMP (Rumble et al., 2009), BWA (Li and Durbin, 2009), SOAP (Li,R. et al., 2008) and Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik). We begin the discussion by describing two ways of preprocessing aligned data for input to SNV detection algorithms. The first method is shown in Figure 1A, where we show an example of aligned data where two SNVs are identified. The reads are positioned according to their alignment in the genome and the reference genome sequence is shown in blue. The first step involves transforming the aligned reads into allelic counts. This method assumes that the reads are correctly aligned and the nucleotide base calls are correct. Nucleotides that match the reference are shown in