Chunk #6 — Online methods — Sample filtering

Source: A reference panel of 64,976 haplotypes for genotype imputation.
Embedded: yes

Text

To detect possible duplicates we used the original genotype calls submitted by the individual studies. We selected 1000 random sites that (1) were biallelic; (2) had European minor allele frequency > 5% in 1000GP3; and (3) had no missing data in any of the individual studies. Using the 'bcftools gtcheck' command, we counted the number of genotypes that differed between each sample pair. There was a clear set of 269 sample pairs with very few genotypes differing over the 1000 sites. We identified these samples as duplicates either within or between studies and removed one of the samples in the pair as described in Supplementary Table 8. Due to some samples being represented more than twice, there were a total of 261 samples removed due to duplicates. Before genotype calling, we also removed (i) 9 samples for which we had Complete Genomics data so that we could use these samples for testing purposes, (ii) 31 samples from 1000GP3 that were related samples (see URLs), (iii) 8 samples from the HELIC, AMD and ProjectMinE studies with sample labeling inconsistencies. These filters resulted in 32,611 samples being used for the genotype calling and phasing steps.