paperKB
coga / coga-kb
Help
Sign in

Chunk #28 — Results — Alignment-Independent Methods

Source
Optimized splitting of mixed-species RNA sequencing data.
Embedded
yes

Text

Our approach was to utilize raw reads from FASTQ files. Reads in the FASTQ file can be represented as a linear succession of L characters or nucleotides. Each nucleotide was coded using a finite alphabet, A, containing five nucleotides, A, T, C, G, or N, pointing to integers 1,2,3,4, or 5, respectively, where ‘N’ is denoted as an ambiguous base due to low quality during sequencing.62 The percentage of N bases per read was calculated for each species to ensure there was no substantial difference (Supplemental Fig. 1C). All the generic strings can be obtained by concatenating characters from A to create the sample space, S. Specifically, each sequencing read, r, can be mapped to feature space, F, from sample space, S, by a function F using the alphabetic index. We then represent each string r of length l as a multidimensional feature vector, x, in the 5l dimensional feature space by x=F(r) according to the alphabet table.