Chunk #36 — Methods — 1000GP Illumina Omni2.5 SNP array data

Source: Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.
Embedded: yes

Text

For the haplotype scaffold, we used a set of 2,141 samples genotyped on Illumina Omni2.5M. This set of samples includes all the 1000GP Phase 1 samples. This dataset contains some parent-child duos and mother-father-child trios, and in some cases just a subset of each family has been sequenced. Supplementary Table 1 gives details of sequenced and non-sequenced samples. We found that 380 and 30 Phase 1 1000GP sequenced samples are part of trios and duos in this data set. SNPs with a missing data rate above 10% and a Mendel error rate above 5% were removed, leaving a total of 2,368,234 SNPs ready for phasing. We phased this data using SHAPEIT2 (r644) using all default settings (W = 2 Mb, K = 100 haplotypes, iterations=45) and using all available family information. We used the resulting haplotypes as a scaffold to call the variant sites in 1000GP. The whole genome overlap between both data sets contains 2,183,314 SNPs.