paperKB
coga / coga-kb
Help
Sign in

Chunk #42 — Methods — Access to sequence data

Source
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.
Embedded
yes

Text

Copies of individual-level sequence data for each study participant are stored on both Google and Amazon clouds. Access involves an approved dbGaP data access request and is mediated by the NCBI Sequence Data Delivery Pilot mechanism. This mechanism uses fusera software82 running on the user’s cloud instance to handle authentication and authorization with dbGaP. It provides read access to sequence data for one or more TOPMed (or other) samples as .cram files (with associated .crai index files) within a fuse virtual file system mounted on the cloud computing instance. Samples are identified by ‘SRR’ run accession numbers assigned in the NCBI Sequence Read Archive (SRA) database and shown under each study’s phs number in the SRA Run Selector (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi). The phs numbers for all TOPMed studies are readily found by searching dbGaP for the string ‘TOPMed’. The fusera software is limited to running on Google or Amazon cloud instances to avoid incurring data egress charges. Fusera, samtools and other tools are also packaged in a Docker container for ease of use and are available for download from Docker Hub83.