Three different CV approaches were used to assess the influence of sample heterogeneity. Results using various classification algorithms are summarized in Fig. 1. Classification performance (AUC) using site-stratified CV (with training on combined samples and equal fold sizes) ranged between 0.57 (95% confidence intervals (CI) = 0.51–0.63; pcorrected = 0.19) and 0.62 (95% CI = 0.56–0.67; pcorrected < 0.001) across different classifiers. All models had statistically significant performance after multiple comparison corrections except for PCA + LR, PCA + SVM and NN classifiers. LOSO-CV led to lower classification performance; 0.51 (95% CI = 0.4–0.62; pcorrected = 1) to 0.54 (95% CI = 0.42–0.65; pcorrected = 1) AUC with relatively high variance across folds (SD = 0.07–0.11) and no classifiers surviving multiple comparison corrections. AUC values obtained through site-stratified CV with varying fold sizes were similar to site-stratified CV results with equal fold sizes, ranging between 0.56 (95% CI = 0.45–0.67; pcorrected > 0.99) and 0.62 (95% CI = (0.51–0.73); pcorrected = 0.55). However, variance across CV-folds was higher and comparable to that from LOSO-CV (SD; site-stratified fixed: 0.02–0.04; site-stratified variable: