Chunk #28 — Results — Alignment-Independent Methods

Source: Optimized splitting of mixed-species RNA sequencing data.
Embedded: yes

Text

Our approach was to utilize raw reads from FASTQ files. Reads in the FASTQ file can be represented as a linear succession of L characters or nucleotides. Each nucleotide was coded using a finite alphabet, A, containing five nucleotides, A, T, C, G, or N, pointing to integers 1,2,3,4, or 5, respectively, where ‘N’ is denoted as an ambiguous base due to low quality during sequencing.62 The percentage of N bases per read was calculated for each species to ensure there was no substantial difference (Supplemental Fig. 1C). All the generic strings can be obtained by concatenating characters from A to create the sample space, S. Specifically, each sequencing read, r, can be mapped to feature space, F, from sample space, S, by a function F using the alphabetic index. We then represent each string r of length l as a multidimensional feature vector, x, in the 5l dimensional feature space by x=F(r) according to the alphabet table.