Chunk #21 — Results and Discussion — 5-mer presence and frequency

Source: GemSIM: general, error-model based simulator of next-generation sequencing data.
Embedded: yes

Text

Our approach to error modelling is dependent on k-mer choice, which needs to be long enough to capture sequence-context information, but also short enough to be represented in the reference genome to be simulated and the control genomes used for error modelling. All possible 5-mers were represented more than four times in the B. aphidicola reference genome, while 83 (or 2%) of all 6-mers were found less than four times. Furthermore, more than 90% of all possible 5-mers were found four or more times in both the PhiX and the plasmid genomes, used for modelling Illumina and Roche/454 errors, respectively. Less than 30% of all possible 6-mers were present four or more times in these two genomes, while all possible 4-mers were found more than four times in the plasmid genome, and all but one in the PhiX genome (Table 2). This suggests that a k-mer length of 5 provides an appropriate balance between capturing relevant sequence-context information and the possibility of overfitting the data (with associated wasted run time and memory requirements).