Chunk #12 — INTRODUCTION — Overview of the procedure

Source: Detecting ultralow-frequency mutations by Duplex Sequencing.
Embedded: yes

Text

Each read obtained from a DS run consists of a 12-nt tag sequence, followed by an invariant 5-bp sequence corresponding to the ligation site. First, the invariant 5-bp sequence is computationally removed from each read, and the 12-nt tag present on each of the two paired-end reads is combined to a single 24-nt tag that is stored in the read header. Sequences with ambiguous nucleotides or homopolymers greater than nine bases within the tag are discarded. These steps are all performed by the custom python script called ‘tag_to_header.py’ (Supplementary Fig. 1). The reads are then aligned to the reference genome using BWA37. After alignment, reads sharing the same tag sequence and genomic coordinates are identified and grouped to form ‘tag families’ with a python script called ‘ConsensusMaker.py’. By default, the script requires three members to result in a tag family. The family members are then compared at each sequence position, and the identity of a position is kept only when at least 70% of the members have the same sequence at that position. Positions that cannot form a consensus are