available, it may be necessary to establish frequency-based thresholds for defining “common” variation that is unlikely to be causal. A third concern is that the specificity of this approach is currently reduced by a subset of genes that recurrently appear enriched for novel variants. These include long genes, but also genes that are subject to systematic technical artifacts (e.g. mis-mapped reads due to duplicated or highly similar sequence in the genome). For sequences that are known to be duplicated or have paralogues (e.g. genes from large gene families, or pseudogenes), these artifacts are mostly removed during read alignment (as reads with non-unique placements are removed from consideration). However, duplicated sequences not represented in the reference genome are not removed and spuriously appear as enriched for novel variants (e.g. CDC27).