GemSIM: general, error-model based simulator of next-generation sequencing data.
- Authors
- McElroy, Kerensa E; Luciani, Fabio; Thomas, Torsten
- Year
- 2012
- Journal
- BMC genomics
- PMID
- 22336055
- DOI
- 10.1186/1471-2164-13-74
- PMCID
- PMC3305602
BACKGROUND: GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects. RESULTS: We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read.The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs. CONCLUSIONS: Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.
True positives, false positives, and accuracy for increasing values of M.A.Q. Graphs for true positives and false positives are in absolute numbers, while accuracy is on a scale from zero to one. (Accuracy is defined as (true positives)/((total SNP no.) + (false positives)). One equals perfect accuracy.) False positive graphs for 'SNP frequency = 1%' and 'all SNP freq. together' are on a logarithmic scale. For false positive graphs, any false positives within +/-1% of the specified frequency are included in the graph.
| # | Section | Preview |
|---|---|---|
| 40 | Availability and requirements | Project home page: http://sourceforge.net/projects/gemsim/ |
| 41 | Availability and requirements | Operating system(s): platform independent. |
| 42 | Availability and requirements | Programming language: Python 2.6 |
| 43 | Availability and requirements | Other requirements: Numpy, Python 2.6 |
| 44 | Availability and requirements | License: GNU GPL v3. |
| 45 | Availability and requirements | Any restrictions to use by non-academics: none. |
| 46 | Competing interests | The authors declare that they have no competing interests. |
| 47 | Authors' contributions | KM wrote the GemSIM code, KM, FL and TT participated in data analysis, KM drafted the manuscript and⦠|
No entities extracted from this document yet.
No uploaded files.
| Citation | PMID | DOI | Status |
|---|---|---|---|
| ARThttp://bioinformatics.joyhz.com/ART/ | β | β | β |
| BalzerSMaldeKLanzenASharmaAJonassenICharacteristics of 454 pyrosequencing data--enabling realistic simulation with flowsimBioinformatics201026i42042510.1093/bioinformatics/btq36520823302PMC2935434 | β | β | β |
| BentleyDRBalasubramanianSSwerdlowHPSmithGPMiltonJBrownCGHallKPEversDJBarnesCLBignellHRAccurate whole human genome sequencing using reversible terminator chemistryNature2008456535910.1038/nature0751718987734PMC2581791 | β | β | β |
| BullRALucianiFMcElroyKGaudieriSPhamSTChopraACameronBMaherLDoreGJWhitePALloydARSequential Bottlenecks Drive Viral Evolution in Early Acute Hepatitis C Virus InfectionPLoS Pathog20117e100224310.1371/journal.ppat.100224321912520PMC3164670 | β | β | β |
| CockPJFieldsCJGotoNHeuerMLRicePMThe Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variantsNucleic Acids Res2010381767177110.1093/nar/gkp113720015970PMC2847217 | β | β | β |
| DohmJCLottazCBorodinaTHimmelbauerHSubstantial biases in ultra-short read data sets from high-throughput DNA sequencingNucleic Acids Res200836e10510.1093/nar/gkn42518660515PMC2532726 | β | β | β |
| EngleMLBurksCGenFrag 2.1: new features for more robust fragment assembly benchmarksComput Appl Biosci199410567568782807610.1093/bioinformatics/10.5.567 | β | β | β |
| Genome Analyzer IIxhttp://www.illumina.com/systems/genome_analyzer_iix.ilmn | β | β | β |
| GillesAMegleczEPechNFerreiraSMalausaTMartinJFAccuracy and quality assessment of 454 GS-FLX Titanium pyrosequencingBmc Genomics20111224510.1186/1471-2164-12-24521592414PMC3116506 | β | β | β |
| HuseSMHuberJAMorrisonHGSoginMLWelchDMAccuracy and quality of massively parallel DNA pyrosequencingGenome Biol20078R14310.1186/gb-2007-8-7-r14317659080PMC2323236 | β | β | β |
| KoboldtDCChenKWylieTLarsonDEMcLellanMDMardisERWeinstockGMWilsonRKDingLVarScan: variant detection in massively parallel sequencing of individual and pooled samplesBioinformatics2009252283228510.1093/bioinformatics/btp37319542151PMC2734323 | β | β | β |
| LiHHandsakerBWysokerAFennellTRuanJHomerNMarthGAbecasisGDurbinRThe Sequence Alignment/Map format and SAMtoolsBioinformatics2009252078207910.1093/bioinformatics/btp35219505943PMC2723002 | β | β | β |
| MarguliesMEgholmMAltmanWEAttiyaSBaderJSBembenLABerkaJBravermanMSChenYJChenZGenome sequencing in microfabricated high-density picolitre reactorsNature20054373763801605622010.1038/nature03959PMC1464427 | β | β | β |
| MetzkerMLSequencing technologies - the next generationNat Rev Genet201011314610.1038/nrg262619997069 | β | β | β |
| mosaik-alignerhttp://code.google.com/p/mosaik-aligner/ | β | β | β |
| NakamuraKOshimaTMorimotoTIkedaSYoshikawaHShiwaYIshikawaSLinakMCHiraiATakahashiHSequence-specific error profile of Illumina sequencersNucleic Acids Res201139e9010.1093/nar/gkr34421576222PMC3141275 | β | β | β |
| NielsenRPaulJSAlbrechtsenASongYSGenotype and SNP calling from next-generation sequencing dataNat Rev Genet20111244345110.1038/nrg298621587300PMC3593722 | β | β | β |
| Novocrafthttp://www.novocraft.com | β | β | β |
| RichterDCOttFAuchAFSchmidRHusonDHMetaSim: a sequencing simulator for genomics and metagenomicsPLoS One20083e337310.1371/journal.pone.000337318841204PMC2556396 | β | β | β |
| Sequence assembly with MIRA3http://sourceforge.net/apps/mediawiki/mira-assembler/ | β | β | β |
| SimSeqhttps://github.com/jstjohn/SimSeq | β | β | β |
| VarScan User's Manualhttp://varscan.sourceforge.net/using-varscan.html | β | β | β |
| Whole Genome Simulationhttp://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation | β | β | β |
In this knowledge base
| Title | Year | PMID |
|---|---|---|
| Statistical modeling for sensitive detection of low-frequency single nucleotide variants. | 2016 | 27556804 |
External
| Title | Authors | Journal | Year | Link |
|---|---|---|---|---|
| A 14-Day Double-Blind, Randomized, Controlled Crossover Intervention Study with Anti-Bacterial Benzyl Isothiocyanate from Nasturtium (<i>Tropaeolum majus</i>) on Human Gut Microbiome and Host Defense. | PfΓ€ffle SP et al. | β | 2024 | β |
| Apclusterv: Refinement of Viral Genome Clustering with Affinity Propagation | Haobin Y et al. | β | 2024 | β |
| Phylogenomic and genomic analysis reveals unique and shared genetic signatures of <i>Mycobacterium kansasii</i> complex species. | Machado E et al. | β | 2024 | β |
| Simulation of nanopore sequencing signal data with tunable parameters. | Gamaarachchi H et al. | β | 2024 | β |
| SWAMPy: simulating SARS-CoV-2 wastewater amplicon metagenomes. | Boulton W et al. | β | 2024 | β |
| Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis. | AkkΓΆse Γ et al. | β | 2023 | β |
| Evaluation of computational phage detection tools for metagenomic datasets. | Schackart KE et al. | β | 2023 | β |
| Identification of representative species-specific genes for abundance measurements. | Zachariasen T et al. | β | 2023 | β |
| Recommendations for the Use of in Silico Approaches for Next-Generation Sequencing Bioinformatic Pipeline Validation: A Joint Report of the Association for Molecular Pathology, Association for Pathology Informatics, and College of American Pathologists. | Duncavage EJ et al. | β | 2023 | β |
| Genome sequence assembly algorithms and misassembly identification methods. | Meng Y et al. | β | 2022 | β |
| J-SPACE: a Julia package for the simulation of spatial models of cancer evolution and of sequencing experiments. | Angaroni F et al. | β | 2022 | β |
| A comprehensive evaluation of binning methods to recover human gut microbial species from a non-redundant reference gene catalog. | Borderes M et al. | β | 2021 | β |
| Phylotranscriptomic analysis of <i>Dillenia indica</i> L. (Dilleniales, Dilleniaceae) and its systematics implication. | Ali MA | β | 2021 | β |
| Prophage Tracer: precisely tracing prophages in prokaryotic genomes using overlapping split-read alignment. | Tang K et al. | β | 2021 | β |
| SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples. | Wei L et al. | β | 2021 | β |
| SomatoSim: precision simulation of somatic single nucleotide variants. | Hawari MA et al. | β | 2021 | β |
| The cp genome characterization of Adenium obesum: Gene content, repeat organization and phylogeny. | Alanazi KM et al. | β | 2021 | β |
| A broad survey of DNA sequence data simulation tools. | Alosaimi S et al. | β | 2020 | β |
| Biases in genome reconstruction from metagenomic data. | Nelson WC et al. | β | 2020 | β |
| Clinical Massively Parallel Sequencing. | Gao G et al. | β | 2020 | β |
| jackalope: A swift, versatile phylogenomic and high-throughput sequencing simulator. | Nell LA | β | 2020 | β |
| SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data. | Giguere C et al. | β | 2020 | β |
| SECNVs: A Simulator of Copy Number Variants and Whole-Exome Sequences From Reference Genomes. | Xing Y et al. | β | 2020 | β |
| SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles. | Yu Z et al. | β | 2020 | β |
| Cell-level somatic mutation detection from single-cell RNA sequencing. | Vu TN et al. | β | 2019 | β |
| Free-access copy-number variant detection tools for targeted next-generation sequencing data. | Roca I et al. | β | 2019 | β |
| GenHap: a novel computational method based on genetic algorithms for haplotype assembly. | Tangherloni A et al. | β | 2019 | β |
| Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. | Choi I et al. | β | 2019 | β |
| MetaCHIP: community-level horizontal gene transfer identification through the combination of best-match and phylogenetic approaches. | Song W et al. | β | 2019 | β |
| MetaSMC: a coalescent-based shotgun sequence simulator for evolving microbial populations. | Liao KH et al. | β | 2019 | β |
| Simulating Illumina metagenomic data with InSilicoSeq. | GourlΓ© H et al. | β | 2019 | β |
| Simulation of heterogeneous tumour genomes with HeteroGenesis and in silico whole exome sequencing. | Tanner G et al. | β | 2019 | β |
| Structural variation and fusion detection using targeted sequencing data from circulating cell free DNA. | GawroΕski AR et al. | β | 2019 | β |
| NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. | Wei ZG et al. | β | 2018 | β |
| Phylogenomics of tomato chloroplasts using assembly and alignment-free method. | Amado CattΓ‘neo RM et al. | β | 2018 | β |
| Population level mitogenomics of long-lived bats reveals dynamic heteroplasmy and challenges the Free Radical Theory of Ageing. | Jebb D et al. | β | 2018 | β |
| SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. | Xia LC et al. | β | 2018 | β |
| Xome-Blender: A novel cancer genome simulator. | Semeraro R et al. | β | 2018 | β |
| De novo transcriptome assembly facilitates characterisation of fast-evolving gene families, MHC class I in the bank vole (Myodes glareolus). | Migalska M et al. | β | 2017 | β |
| DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. | Lee B et al. | β | 2017 | β |
| In Silico Proficiency Testing for Clinical Next-Generation Sequencing. | Duncavage EJ et al. | β | 2017 | β |
| IntSIM: An Integrated Simulator of Next-Generation Sequencing Data. | Yuan X et al. | β | 2017 | β |
| Large-scale comparative metagenomics of Blastocystis, a common member of the human gut microbiome. | Beghini F et al. | β | 2017 | β |
| MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples. | Zolfo M et al. | β | 2017 | β |
| Microbial strain-level population structure and genetic diversity from metagenomes. | Truong DT et al. | β | 2017 | β |
| Promises and pitfalls of Illumina sequencing for HIV resistance genotyping. | Brumme CJ et al. | β | 2017 | β |
| Systematic review of next-generation sequencing simulators: computational tools, features and perspectives. | Zhao M et al. | β | 2017 | β |
| Testing genotyping strategies for ultra-deep sequencing of a co-amplifying gene family: MHC class I in a passerine bird. | Biedrzycka A et al. | β | 2017 | β |
| A comparison of tools for the simulation of genomic next-generation sequencing data. | Escalona M et al. | β | 2016 | β |
| AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework. | Zheng Q et al. | β | 2016 | β |
| Contig-Layout-Authenticator (CLA): A Combinatorial Approach to Ordering and Scaffolding of Bacterial Contigs for Comparative Genomics and Molecular Epidemiology. | Shaik S et al. | β | 2016 | β |
| Potential and pitfalls of eukaryotic metagenome skimming: a test case for lichens. | Greshake B et al. | β | 2016 | β |
| Redundans: an assembly pipeline for highly heterozygous genomes. | Pryszcz LP et al. | β | 2016 | β |
| Refined analyses suggest that recombination is a minor source of genomic diversity in <i>Pseudomonas aeruginosa</i> chronic cystic fibrosis infections. | Williams D et al. | β | 2016 | β |
| RNF: a general framework to evaluate NGS read mappers. | BΕinda K et al. | β | 2016 | β |
| Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models. | Stephens ZD et al. | β | 2016 | β |
| Single-cell TCRseq: paired recovery of entire T-cell alpha and beta chain transcripts in T-cell receptors from single-cell RNAseq. | Redmond D et al. | β | 2016 | β |
| Statistical modeling for sensitive detection of low-frequency single nucleotide variants. | Hao Y et al. | β | 2016 | β |
| Strain-level microbial epidemiology and population genomics from shotgun metagenomics. | Scholz M et al. | β | 2016 | β |
| The PARA-suite: PAR-CLIP specific sequence read simulation and processing. | Kloetgen A et al. | β | 2016 | β |
| Vecuum: identification and filtration of false somatic variants caused by recombinant vector contamination. | Kim J et al. | β | 2016 | β |
| A simple data-adaptive probabilistic variant calling model. | Hoffmann S et al. | β | 2015 | β |
| Best practices for evaluating single nucleotide variant calling methods for microbial genomics. | Olson ND et al. | β | 2015 | β |
| cFinder: definition and quantification of multiple haplotypes in a mixed sample. | Niklas N et al. | β | 2015 | β |
| Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. | Pulido-Tamayo S et al. | β | 2015 | β |
| High-Specificity Targeted Functional Profiling in Microbial Communities with ShortBRED. | Kaminski J et al. | β | 2015 | β |
| High-Throughput, Amplicon-Based Sequencing of the CREBBP Gene as a Tool to Develop a Universal Platform-Independent Assay. | Fuellgrabe MW et al. | β | 2015 | β |
| Hybrid de novo tandem repeat detection using short and long reads. | Fertin G et al. | β | 2015 | β |
| International interlaboratory study comparing single organism 16S rRNA gene sequencing data: Beyond consensus sequence comparisons. | Olson ND et al. | β | 2015 | β |
| Metagenomics: Retrospect and Prospects in High Throughput Age. | Kumar S et al. | β | 2015 | β |
| misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads. | Zhu X et al. | β | 2015 | β |
| Polyester: simulating RNA-seq datasets with differential transcript expression. | Frazee AC et al. | β | 2015 | β |
| SoloDel: a probabilistic model for detecting low-frequent somatic deletions from unmatched sequencing data. | Kim J et al. | β | 2015 | β |
| A better sequence-read simulator program for metagenomics. | Johnson S et al. | β | 2014 | β |
| CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing. | Kwon S et al. | β | 2014 | β |
| Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions. | McElroy K et al. | β | 2014 | β |
| Evaluation of viral genome assembly and diversity estimation in deep metagenomes. | Aguirre de CΓ‘rcer D et al. | β | 2014 | β |
| FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. | Shcherbina A | β | 2014 | β |
| FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. | Silva GG et al. | β | 2014 | β |
| HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis. | Santana-Quintero L et al. | β | 2014 | β |
| Parametric modeling of whole-genome sequencing data for CNV identification. | Vardhanabhuti S et al. | β | 2014 | β |
| PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. | Zhu X et al. | β | 2014 | β |
| SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. | Pattnaik S et al. | β | 2014 | β |
| Somatic deletions implicated in functional diversity of brain cells of individuals with schizophrenia and unaffected controls. | Kim J et al. | β | 2014 | β |
| Strain-specific parallel evolution drives short-term diversification during Pseudomonas aeruginosa biofilm formation. | McElroy KE et al. | β | 2014 | β |
| TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. | Glaubitz JC et al. | β | 2014 | β |
| XS: a FASTQ read simulator. | Pratas D et al. | β | 2014 | β |
| Accurate single nucleotide variant detection in viral populations by combining probabilistic clustering with a statistical test of strand bias. | McElroy K et al. | β | 2013 | β |
| Benchmarking short sequence mapping tools. | Hatem A et al. | β | 2013 | β |
| Combining de novo and reference-guided assembly with scaffold_builder. | Silva GG et al. | β | 2013 | β |
| Computational meta'omics for microbial community studies. | Segata N et al. | β | 2013 | β |
| Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges. | Liu B et al. | β | 2013 | β |
| Empirical assessment of sequencing errors for high throughput pyrosequencing data. | da Fonseca PG et al. | β | 2013 | β |
| Evaluating genome architecture of a complex region via generalized bipartite matching. | Lo C et al. | β | 2013 | β |
| NeSSM: a Next-generation Sequencing Simulator for Metagenomics. | Jia B et al. | β | 2013 | β |
| Reconstructing the genomic content of microbiome taxa through shotgun metagenomic deconvolution. | Carr R et al. | β | 2013 | β |
| Routine performance and errors of 454 HLA exon sequencing in diagnostics. | Niklas N et al. | β | 2013 | β |
| Short barcodes for next generation sequencing. | Mir K et al. | β | 2013 | β |
| Virmid: accurate detection of somatic mutations with sample impurity inference. | Kim S et al. | β | 2013 | β |
| Wessim: a whole-exome sequencing simulator based on in silico exome capture. | Kim S et al. | β | 2013 | β |
| bgc: Software for Bayesian estimation of genomic clines. | Gompert Z et al. | β | 2012 | β |
| Reconstruction of ribosomal RNA genes from metagenomic data. | Fan L et al. | β | 2012 | β |