Sequencing reads were aligned to the NCBI36 reference genome (details in Supplementary Information) and made available in the BAM file format17, an early innovation of the project for storing and sharing high throughput sequencing data. Accurate identification of genetic variation depends on alignment of the sequence data to the correct genomic location. We restricted most variant calling to the “accessible genome”, defined as that portion of the reference sequence that remains after excluding regions with many ambiguously placed reads or unexpectedly high or low numbers of aligned reads (Supplementary Information). This approach balances the need to reduce incorrect alignments and false positive detection of variants against maximizing the proportion of the genome that can be interrogated