The remedy to this pitfall is to use external validation. In some cases independent data sets are not available in which case internal cross-validation is the only option. In cross-validation it is important to avoid the pitfall of updating the predictor based on results derived from the validation sample, hence losing the independence of discovery and validation samples that the strategy has set out to achieve57. Overlap in samples can be checked as part of quality control (QC) of the prediction pipeline, by estimating pairwise relatedness using SNP data, but this requires access to full genotype data from both discovery and validation samples. There are many software tools that can do this, including PLINK58 and GCTA59.