Chunk #1 — ENCODE data production and initial analyses — Integration methodology

Source: An integrated encyclopedia of DNA elements in the human genome.
Embedded: yes

Text

For consistency, data were generated and processed using standardized guidelines, and for some assays, new quality-control measures were designed (see refs 3,12, http://encodeproject.org/ENCODE/dataStandards.html and Kundaje, A. Personal Communication). Uniform data-processing methods were developed for each assay (see Supplementary Information and Kundaje, A. Personal Communication), and most assay results can be represented both as signal information, a per-base estimate across the genome and as discrete elements, regions computationally identified as enriched for signal. Extensive processing pipelines were developed to generate each representation (M.M. Hoffman et al., manuscript in preparation, Kundaje, A. Personal Communication). In addition we developed the irreproducible discovery rate (IDR)13 measure to provide a robust and conservative estimate of the threshold where two ranked lists of results from biological replicates no longer agree (i.e., are irreproducible) and we applied this to defining sets of discrete elements. We identified, and excluded from most analyses, regions yielding untrustworthy signals likely to be artifactual (e.g., multi-copy regions). Together, these regions comprise 0.39% of the genome (see Supplementary Information). The accompanying poster represents different ENCODE-identified elements and their genome coverage.