Chunk #89 — Materials and methods — Filtering NA12878 data for the discovery of uncharacterized bias

Source: Characterizing and measuring bias in sequence data.
Embedded: yes

Text

The assembly-based analysis is limited to detection of variations that occur within contigs. To test for biological variations that might lie in assembly gaps, we identified genome locations that were well covered in data sets that mixed reads from diverse individuals, but were undercovered in multiple NA12878 data sets. First, we gathered two diverse sets of Illumina HiSeq sequencing data aligned to HG19: the first from 39 individuals (198-fold total, data set A2) from the 1000 Genomes Project [32] sequenced with version 2 chemistry and the second from 71 individuals (253-fold total, data set A3) from the 1000 Genomes Project sequenced with version 3 chemistry (see Data for SRA accession numbers). Any reference base with relative coverage of at least 0.5 in either diverse data set was considered 'well covered'. Second, we gathered three NA12878 Illumina HiSeq data sets aligned to HG19: 152-fold from HiSeq v2 chemistry (the Phusion, Phusion + betaine, and AccuPrime data discussed previously, data sets 10 to 12), 110-fold from version 3 chemistry using low-input Fisher et al. library construction (data set 13 with four additional