Chunk #85 — Materials and methods — Computing coverage and counting errors from alignments

Source: Characterizing and measuring bias in sequence data.
Embedded: yes

Text

For purposes of tracking error biases, we determined the number of CIGAR M, =, or X-mapped read bases where the read nucleotide differed from the reference nucleotide, and counted these as mismatches at the reference position. Similarly, deletions at a reference base were counted by incrementing a counter every time the CIGAR D operator is used to skip that base. Insertion errors are more problematic because these bases exist in the read but have no reference position. Some convention is necessary, so if an alignment contained an insertion of length L, denoted by 'LI' in the CIGAR string, we charged L insertions to the reference base immediately after the inserted sequence. For consistency, all error rates reported in the paper are computed relative to coverage levels: that is, error rates are fractions in which the numerator is the error count in a region or motif and the denominator is the number of mapped bases.