To allow new users to familiarize themselves with RICOPILI and experienced users to develop new functionality for the pipeline, we simulated freely available GWAS data using HAPGEN (Su et al., 2011) (Supplementary Section S6). The dataset comprises 6200 ‘individuals’ across ∼600 000 markers based on the Illumina OmniExpress, a widely used genotyping platform. For training and development purposes, population stratification, cross-sample relatedness and technical errors were introduced to the simulated data. The sample is separated into five datasets ‘HapGen5’ packaged with RICOPILI (https://docs.google.com/document/d/1ux_FbwnvSzaiBVEwgS7eWJoYlnc_o0YHFb07SPQsYjI/). Data description and results are described in further detail in Extended Data Analysis and User Guide.