Chunk #23 — 2 METHODS — 2.6 Datasets

Source: SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors.
Embedded: yes

Text

We used three datasets to evaluate our models. The first (Supplementary Dataset 1A and B) consists of 16 ovarian cancer transcriptomes sequenced using the Illumina GA II platform, RNA-Seq paired end protocol. (Note that this data has been generated as part of an ongoing study to profile ovarian carcinoma subtypes, and the full datasets will be available as part of forthcoming manuscripts. However, all SNV data referenced in this manuscript is available as Supplementary Materials). For each of these cases, we obtained Affymetrix SNP 6.0 high-density genotyping arrays from the corresponding DNA. We examined coding positions in the transcriptome data for which there was a corresponding high-confidence (>0.99) genotyping call from the array predicted using the CRLMM algorithm (Lin et al., 2008). This resulted in an average of approximately 9000 positions from each case and a total of 144 271 positions. These data were used in the cross-validation experiment, described below. The second dataset (Supplementary Dataset 2A–D) consisted of 497 positions from a lobular breast tumor genome predicted as SNVs using SNVMix1 model from data generated using the Illumina GA