Pipeline Steps

1. Split genome into sub-intervals for parallelization

Split the reference genome into intervals for parallel processing. If params.parallelize_by_chromosome is set then the genome will be split by chromosome, otherwise it will be split into up to params.scatter_count intervals.

2. Realign indels

Generate indel realignment targets and realign indels per interval.

3. Generate BQSR (Base Quality Score Recalibration)

Assess how sequencing errors correlate with four covariates (assigned quality score, read group, machine cycle producing this base, and current and immediately upstream base) and output base quality score recalibration table.

4. Apply BQSR per split interval in parallel

Apply the base quality score recalibration to each interval and reheader output as necessary.

5. Merge interval-level BAMs

Merge BAMs from each interval to generate a whole sample BAM.

5a. Deduplicate BAM

If params.parallelize_by_chromosome is not set, run a deduplication process to remove reads duplicated due to overlap on interval splitting sites.

6. Index BAM file

Generate a BAI index file for fast random access of the whole sample BAM.

7. Get pileup summaries

Tabulate pileup metrics for inferring contamination. Summarize counts of reads that support reference, alternate and other alleles for given sites.

8. Calculate contamination

Calculates the fraction of reads coming from cross-sample contamination, given results from Step 7. For paired samples, generates an additional output table containing segmentation of the tumor by minor allele fraction.

9. DepthOfCoverage

If params.is_DOC_run is set, generate coverage summary information for the whole sample BAM from step 5, partitioned by sample, read group, and library.

10. Generate sha512 checksum

Generate sha512 checksum for final BAM and BAI files.