Pipeline Steps
1. Split genome into sub-intervals for parallelization
Split the reference genome into intervals for parallel processing. If params.parallelize_by_chromosome
is set then the genome will be split by chromosome, otherwise it will be split into up to params.scatter_count
intervals.
2. Realign indels
Generate indel realignment targets and realign indels per interval.
3. Generate BQSR (Base Quality Score Recalibration)
Assess how sequencing errors correlate with four covariates (assigned quality score, read group, machine cycle producing this base, and current and immediately upstream base) and output base quality score recalibration table.
4. Apply BQSR per split interval in parallel
Apply the base quality score recalibration to each interval and reheader output as necessary.
5. Merge interval-level BAMs
Merge BAMs from each interval to generate a whole sample BAM.
5a. Deduplicate BAM
If params.parallelize_by_chromosome
is not set, run a deduplication process to remove reads duplicated due to overlap on interval splitting sites.
6. Index BAM file
Generate a BAI index file for fast random access of the whole sample BAM.
7. Get pileup summaries
Tabulate pileup metrics for inferring contamination. Summarize counts of reads that support reference, alternate and other alleles for given sites.
8. Calculate contamination
Calculates the fraction of reads coming from cross-sample contamination, given results from Step 7. For paired samples, generates an additional output table containing segmentation of the tumor by minor allele fraction.
9. DepthOfCoverage
If params.is_DOC_run
is set, generate coverage summary information for the whole sample BAM from step 5, partitioned by sample, read group, and library.
10. Generate sha512 checksum
Generate sha512 checksum for final BAM and BAI files.