Chunk #56 — Methods — Genome-wide distribution of genetic variation — Concatenated segment analysis

Source: Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.
Embedded: yes

Text

Distinct base classifications were defined by both coding and noncoding annotations (based on Ensembl 9492) and CADD in silico prediction scores21 (downloaded from the CADD server for all possible SNVs). For each base, we used the maximum possible CADD score (when using the minimum CADD score, results were qualitatively the same). Bases beyond the final base with a CADD score per chromosome were excluded. This resulted in six distinct types of concatenated segments: high (CADD ≥ 20), medium (10 ≤ CADD < 20) and low (CADD < 10) CADD scores for coding and similarly for noncoding variants. Common (allele frequency ≥ 0.005) and rare (allele frequency < 0.005) variant counts were then calculated across these concatenated segments. Multi-allelic variants and those in regions masked due to accessibility were excluded. Counts in segments at the end of chromosomes were scaled up as in the contiguous analysis.