The list of 'bad promoters' was identified based on data from 39 individuals sequenced on Illumina HiSeq v2 for the 1000 Genomes Project (198-fold total coverage, data set A2). To obtain the list, for each transcription-start site in the RefSeq database [50], the ratio of average coverage in the surrounding 200 bases to average coverage in the surrounding 3,000 bases was computed. Then the 1,000 sites with the lowest ratios were designated as 'bad promoters' and are listed in Additional file 1. If the database contained multiple entries for the same gene, the entry with the lowest coverage ratio was kept for the list. For comparison purposes, we used the same algorithm on a HiSeq v3 1000 Genomes data set (A3, 253-fold coverage of 71 individuals) to generate Additional file 2.