Chunk #7 — Background — GemErr

Source: GemSIM: general, error-model based simulator of next-generation sequencing data.
Embedded: yes

Text

Reads are sequentially parsed, tracking the total number of reads and read length distributions. For paired-end reads, insert size, whether the read is the first or second read in the pair, and the proportion of properly aligned pairs are also recorded. For each base of each read the following information is then stored: a) nucleotide type and base position in read; b) mismatch or true base for the position; c) indels following the current position; d) preceding three bases in the read; e) following base in the read, and f) quality scores for true and mismatch bases and insert bases. Although it is mainly the sequence preceding the current position that is known to affect error rates [8,9], the following base in the read is tracked to allow accurate simulation of indels within homopolymers. Sequence aligners record these errors either at the start or end of a homopolymer. By taking the following base into consideration, indels are only inserted once within long homopolymers, at the end, rather than potentially multiple times within the homopolymer. Empirical distributions for tracked information are stored to a file and used as error models for input into GemReads.