A comparison of BeadChip and WGS genotyping outputs using partial validation by sanger sequencing.
- Authors
- Danilov, Kirill A; Nikogosov, Dimitri A; Musienko, Sergey V; Baranova, Ancha V
- Year
- 2020
- Journal
- BMC genomics
- PMID
- 32912136
- DOI
- 10.1186/s12864-020-06919-x
- PMCID
- PMC7488117
BACKGROUND: Head-to-head comparison of BeadChip and WGS/WES genotyping techniques for their precision is far from straightforward. A tool for validation of high-throughput genotyping calls such as Sanger sequencing is neither scalable nor practical for large-scale DNA processing. Here we report a cross-validation analysis of genotyping calls obtained via Illumina GSA BeadChip and WGS (Illumina HiSeq X Ten) techniques. RESULTS: When compared to each other, the average precision and accuracy of BeadChip and WGS genotyping techniques exceeded 0.991 and 0.997, respectively. The average fraction of discordant variants for both platforms was found to be 0.639%. A sliding window approach was utilized to explore genomic regions not exceeding 500 bp encompassing a maximal amount of discordant variants for further validation by Sanger sequencing. Notably, 12 variants out of 26 located within eight identified regions were consistently discordant in related calls made by WGS and BeadChip. When Sanger sequenced, a total of 16 of these genotypes were successfully resolved, indicating that a precision of WGS and BeadChip genotyping for this genotype subset was at 0.81 and 0.5, respectively, with accuracy values of 0.87 and 0.61. CONCLUSIONS: We conclude that WGS genotype calling exhibits higher overall precision within the selected variety of discordantly genotyped variants, though the amount of validated variants remained insufficient.
Whole genome depth of coverage distributions. Metrics for sample_001 (a), sample_002 (b), sample_003 (c) and breadth of coverage for the specified depth thresholds (d) averaged for all three samples are shown with 95% confidence intervals, n = 3
LLM interpretation
This figure consists of three histograms (A, B, C) showing the distribution of whole genome depth of coverage for three separate samples, with observation frequency on the y-axis and depth of coverage on the x-axis. All three samples exhibit a similar peak distribution centered around 30x coverage. Panel D is a bar chart showing the percentage of bases exceeding specific depth thresholds (5x to 30x), demonstrating a decrease in the percentage of bases as the depth threshold increases, including 95% confidence intervals for n=3.
Fractions of discordant results for three samples. Percentage of discordant results per each chromosome is shown where applicable
LLM interpretation
This grouped bar chart displays the percentage of discordant results across chromosomes for three samples (sample_001, sample_002, and sample_003). The x-axis lists chromosomes 1–22, X, Y, and MT, while the y-axis measures "Discordance, %" from 0 to 25. Discordance remains low and relatively stable across chromosomes 1–22, with a significant increase observed in the mitochondrial (MT) region, where sample_001 shows the highest discordance at approximately 25%.
Distance maps for the analyzed samples. a — sample_001, b — sample_002, c — sample_003, concordant and discordant variants are marked in green and orange, respectively
LLM interpretation
This figure consists of three scatter plots with overlaid density contours (A, B, and C) representing distance maps for three different samples. The x-axis shows "Distance before, $\log_{10}(\text{bp})$" and the y-axis shows "Distance after, $\log_{10}(\text{bp})$." In each plot, concordant variants are clustered in green/cyan at lower distance values, while discordant variants are clustered in orange at higher distance values.
Confusion matrices calculated for the call sets obtained by WGS and BeadChip. WGS was defined as “true” call set, BeadChip — “test” call set, data is shown for sample_001 (male, chromosomes MT, X, Y were excluded from analysis), sample_002 (female, chromosome MT was excluded from analysis) and sample_003 (male, chromosomes MT, X, Y were excluded from analysis)
LLM interpretation
This figure consists of three confusion matrices comparing genotype call sets from BeadChip (test) against WGS (true) for three samples (sample_001, sample_002, and sample_003). The x-axis represents the WGS call set and the y-axis represents the BeadChip call set, with both axes labeled with genotypes (A/A, A/B, B/B, A/C, B/C, C/C). The heatmaps show a strong diagonal trend, indicating high agreement between the two methods, with the highest counts concentrated in the matching genotype cells.
BeadChip genotyping quality metrics with highlighted Sanger-validated variants. Theta, R, GC Score values for sample_002 are shown; histograms show the corresponding distributions of plotted metrics in a 1-dimensional space; concordant and discordant variants are marked in blue and orange, respectively; genotypes which are not consistent with Sanger sequencing in both WGS and BeadChip results are marked with a star, matches between Sanger and BeadChip are marked with triangles, matches between Sanger and WGS are marked with circles, variants which were not successfully genotyped by Sanger are marked with crosses
LLM interpretation
This figure consists of two scatter plots with marginal histograms showing BeadChip genotyping quality metrics (GC score vs. Theta value and R value vs. Theta value) for sample_002. The plots display a distribution of concordant (blue) and discordant (orange) variants, with specific variants highlighted by symbols (stars, triangles, circles, and crosses) to indicate validation status against Sanger sequencing and WGS. The marginal histograms illustrate the 1-dimensional distribution of the Theta, GC score, and R value metrics.
Example calculation of confusion matrices. The shown dimensionality reduction is used for accuracy and other metrics calculation; a, b, c, d, …, ai, aj — sample counts of each class; A/A, A/B, B/B, A/C, B/C, C/C — diploid genotypes observed in data; A — reference allele, B and C — alternative alleles. TRUTH — a call set produced by an orthogonal method (comparator), TEST — a call set produced by a test method
LLM interpretation
This figure consists of two diagrams illustrating the calculation of confusion matrices for genotype calls. Each panel shows a large matrix comparing a "TEST" method against a "TRUTH" method across six diploid genotypes (A/A, A/B, B/B, A/C, B/C, C/C), with individual cells representing sample counts. Arrows indicate how specific cells from the large matrix are aggregated into smaller 2x2 binary confusion matrices to calculate accuracy and other metrics for a single genotype.
Quality metrics calculation for the initial and the “reduced confusion” matrices. Each metric is calculated as a ratio of blue elements to orange-outlined elements; A/A, A/B, B/B, A/C, B/C, C/C — diploid genotypes observed in data; A — reference allele, B and C — alternative alleles; N/N — any diploid genotype category (A/A, A/B, B/B, A/C, B/C, C/C). TRUTH — a call set produced by an orthogonal method (comparator), TEST — a call set produced by a test method
LLM interpretation
This figure consists of several diagrams illustrating the calculation of quality metrics using confusion matrices. The top row shows three $6 \times 6$ matrices comparing "TRUTH" and "TEST" diploid genotypes (A/A through C/C) to define genotype concordance and non-reference genotype sensitivity/concordance. The bottom row displays four $2 \times 2$ matrices that simplify genotypes into "N/N" and "other" categories to calculate sensitivity, specificity, precision, and accuracy. Blue shading indicates the elements used in the numerator for each specific metric calculation.
No entities extracted from this document yet.
No uploaded files.
In this knowledge base
| Title | Year | PMID |
|---|---|---|
| GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing. | 2022 | 35953715 |
External
| Title | Authors | Journal | Year | Link |
|---|---|---|---|---|
| Application of multigene panel testing for bleeding, thrombotic, and platelet disorders in patients and the general population in China. | Cai Y et al. | — | 2025 | → |
| Genome-wide association analysis of fleece traits in Northwest Xizang white cashmere goat. | Lu X et al. | — | 2024 | → |
| Whole-genome sequencing analysis of suicide deaths integrating brain-regulatory eQTLs data to identify risk loci and genes. | Han S et al. | — | 2023 | → |
| Comparing BeadChip and WGS Genotyping: Non-Technical Failed Calling Is Attributable to Additional Variation within the Probe Target Sequence. | Gershoni M et al. | — | 2022 | → |
| GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing. | Mathur R et al. | — | 2022 | → |
| Expanding the pool of public controls for GWAS via a method for combining genotypes from arrays and sequencing | Mathur R et al. | — | 2021 | — |
| Frequency of allele variations in the CFTR gene in a Mexican population. | Cantú-Reyna C et al. | — | 2021 | → |
| Genomics and Systems Biology at the "Century of Human Population Genetics" conference. | Tatarinova TV et al. | — | 2020 | → |