Chunk #10 — Method — Feature selection and classification model estimation

Source: Predicting risk for Alcohol Use Disorder using longitudinal data with multimodal biomarkers and family history: a machine learning study.
Embedded: yes

Text

Feature selection and model estimation and validation were done separately for every group (i.e. only EEG, only SNPs, combined EEG+SNP, male, female, AA, EA and different age groups). To control for variables overfitting we used regularization method4, 34, enhancing the prediction accuracy and interpretability of the statistical model. Specifically, for feature selection we used the least absolute shrinkage and selection operator (LASSO) penalty as described by Tibshirani (1996)35. The sparsity property of LASSO (i.e. generating coefficient estimates of exactly zero), makes it attractive for feature selection as it reduces the estimation variance while providing a more interpretable final model 36. Its application to genomic data 37, 38 has shown that selecting a small number of representative features can achieve satisfactory classification. We first determined the regularization parameter using a 10-fold cross-validation (CV) procedure, with the label: control vs. AUD as the response variable. All features with a non-zero coefficient were retained for subsequent analyses. The reduced set of most discriminant features were fed into the classifier to classify the study participants into their respective groups, i.e., either AUD or controls.