For each batch, we filtered out variants with genotyping call rate <0.98 and samples with call rate <0.98, and removed variants that were duplicated, monogenic, or not correctly mapped to a genomic position. We then merged TWB samples with 1KG phase 3 data (N=2504), and selected high-quality, common variants shared between the two datasets. Next, we performed LD-pruning (PLINK --indep-pairwise 200 100 0.1) and computed PCs of the merged genotype data with LD-pruned variants. Using the population labels of 1KG samples as the reference, we trained a random forest model with top 6 PCs to classify TWB samples into 1KG super-population groups. We retained TWB samples that can be assigned to a homogeneous East Asian group with a predicted probability >0.8 (Additional File 1: Fig. S2). After population assignment, we filtered out outliers in heterozygosity rate and population-specific PCs, and samples with sex mismatch. Imputation was performed using Eagle v2.4 (for pre-phasing) [35] and Minimac4 [36] with 1KG phase 3 data as the reference panel. We randomly removed one sample from each related pair of individuals within or across batches,