Chunk #55 — Methods — Genome-wide distribution of genetic variation — Contiguous segment analysis

Source: Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.
Embedded: yes

Text

We excluded indels and multi-allelic variants, and categorized the remaining variants as common (allele frequency ≥ 0.005) or rare (allele frequency < 0.005), and as coding or noncoding based on protein-coding exons from Ensembl 9492. Variant counts were analysed across 2,739 non-empty (that is, with at least one variant) contiguous 1-Mb chromosomal segments, and counts in segments at the end of chromosomes with length L < 106 bp were scaled up proportionally by the factor 106 × L−1. For each segment, the coding proportion, C, was calculated as the proportion of bases overlapping protein-coding exons. The distribution of C is fairly narrow, with 80% of segments having C ≤ 0.0195, 99% of segments have C ≤ 0.067 and only 3 segments having C ≥ 0.10. Owing to the significant negative correlation between C and the number of variants in a segment, and potential mapping effects, we use linear regression to adjust the variant counts per segment according to the model count = β × C + A + count_adj, where A is the proportion of segment bases overlapping the accessibility mask (Supplementary Information 1.5). Unless otherwise noted, we present analyses and results that use these adjusted count values.