While the concept of using EHR systems as a tool for discovery in genome science is appealing, a major initial obstacle that this approach had to overcome was whether the phenotypes represented in EHR systems were in fact at all useful for defining important human phenotypes. One of the major challenges has been to understand optimal ways to analyze multiple types of data contained in an EHR to develop algorithms to identify subjects with target diseases (cases) and those who do not have the diseases (controls). Some phenotypes may be relatively “easy” to ascertain. For example, if an investigator is interested in identifying cases of atrial fibrillation, and establishes that a 12-lead electrocardiogram recording the abnormal rhythm is required to establish the subject as a case, all that is required is searching electrocardiograms for instances of atrial fibrillation. Even here, however, algorithms may be imperfect: the electrocardiogram may be misread or the rhythm may be documented only in text notes or in poorly reproduced rhythm strips. While such records might not meet a case definition, they would be inappropriate to include as controls.