Chunk #14 — Method — Genome-wide Scoring Procedure

Source: Three mutually informative ways to understand the genetic relationships among behavioral disinhibition, alcohol use, drug use, nicotine use/dependence, and their co-occurrence: twin biometry, GCTA, and genome-wide scoring.
Embedded: yes

Text

Gross overfitting is expected when the same sample is used to generate and validate the SNP score, especially when the number of predictors is much greater than the number of subjects. To control for overfitting we employed a k-fold cross-validation technique (Breiman & Spector, 1992; Hastie, Tibshirani, & Friedman, 2009). For this study we set the number of k folds to be 10. To accomplish this, subjects were split into 10 roughly equal subsamples (707, 734, 719, 718, 724, 690, 737, 734, 725, 700). The scoring algorithm described above is conducted by combining 9 subsamples, providing a set of SNP weights based on the 9 subsamples combined. These weights were then applied to the minor allele counts in the 10th sample and correlated with the phenotype in that sample, producing an unbiased estimate of the cross-validated validity of the SNP score. This same procedure is used for every combination of the 10 samples, such that every single subject is in a development sample nine times and in the test sample once.