Chunk #25 — Results — Alignment-Independent Methods

Source: Optimized splitting of mixed-species RNA sequencing data.
Embedded: yes

Text

The aim of an alignment-independent method is to build a classifier to distinguish sequence reads from different species without aligning to individual reference genomes. This requires us capture the hidden information directly from the nucleotide sequence of both mouse and human. We first sought to apply a classical probabilistic based approach, Hidden Markov models (HMMs), to discover the underlying variance between human and mouse sequence. HMMs are used in sequence data analysis with many bioinformatics applications45, 46 including identification of genes, motifs finding, metagenomic taxonomic classification.47–49 However, third order HMMs did not separate sequence fragments from mouse and human (Fig. 2A). Even with higher order Markov Models (8th or higher) which successfully performed metagenomic sequence classification,50 the separation of human and mouse reads is not ideal. The receiver operating characteristic curve51 indicates that this binary classifier system only slightly improves with higher order models (Fig. 2B). Moreover, we noticed that HMMs require substantial amount of memory and compute time. For a sequence of length l, the memory to find the best path through the model with s states and e edges proportional to sl and the time proportional to el.