paperKB
coga / coga-kb
Help
Sign in

Chunk #35 — 4 Merits of a high-level language — 4.4 Performance and scalability

Source
Orchestrating high-throughput genomic analysis with Bioconductor.
Embedded
yes

Text

Effectively working with large data requires programming practices that match memory and processor use to available resources. R is efficient when operating on vectors or arrays, so a pattern used by high-performing and scalable algorithms is to split the data into manageable chunks and to iterate over them. An example is the yieldSize argument of functions that process BAM, FASTQ or VCF files [20]. Chunks can be evaluated in parallel to gain speed. The BiocParallel package helps developers employ parallel evaluation across different computing environments while shielding users from having to configure the technicalities. It connects to back ends for shared memory and cluster configurations. The GenomicFiles package ties parallelization to chunkwise operations across multiple files. Bioconductor is available as a virtual machine image configured for high-performance computing in Amazon's Elastic Compute Cloud (EC2).