Chunk #88 — Materials and methods — Filtering NA12878 data for the discovery of uncharacterized bias

Source: Characterizing and measuring bias in sequence data.
Embedded: yes

Text

We took the previously published NA12878 assembly, produced from a different set of Illumina data [52], and aligned its contigs to the HG19 reference. For each instance in which a contig in the NA12878 assembly contained a gap relative to the reference, we excluded the gap sequence from the undercovered set. Contigs from the ALLPATHS-LG assembly of NA12878 were aligned to the human reference hg19 with BWA-SW version 0.5.9 [38] using default arguments. Contigs longer than 100 kb were split before alignment so as to stay within the aligner's maximum read length. The splitting algorithm ensures that the resulting subsequences are no shorter than 50 kb. When BWA-SW detects large deletions in the contig-reads relative to the reference, it splits the alignments, treating the contigs as chimeric reads. Additionally, we scanned all the aligned contigs and marked any sliding 100-base windows that exhibited more than five alignment errors (mismatches, deleted bases, or inserted bases) as areas that may have high local rates of polymorphism. These regions are excluded from consideration because reads that cover them may fail to align to the reference, which would reduce apparent coverage even in the absence of sequencing bias.