Pipeline Steps

1. Split genome or target intervals into sub-intervals for parallelization

Use the input target intervals or the whole genome intervals and split them into sub-intervals for parallel processing.

2. HaplotypeCaller

Generate VCF for each split interval using HaplotypeCaller. Generate GVCF for SNPs and INDELs.

3. Merge raw VCFs and GVCFs

Merge raw variants from each interval.

4. VQSR - SNPs

Generate VQSR (Variant Quality Score Recalibration) model for SNPs.

5. VQSR - INDELs

Generate VQSR model for INDELs.

6. VQSR - Apply SNP model

Take the whole sample raw VCF from Step 3 as input, and apply the model in Step 4 to generate variants in which only SNPs are recalibrated.

7. VQSR Apply INDEL model

Take the output from Step 6 as input, and apply the model in Step 5 to recalibrate only INDELs.

Steps 4 through 7 model the technical profile of variants in a training set and uses that to filter out probable artifacts from the raw VCF. After these four steps, a recalibrated VCF is generated.

8. Filter gSNP – Filter out ambiguous variants

Use customized Perl script to filter out ambiguous variants.

9. Generate sha512 checksum

Generate sha512 checksum for VCFs and GVCFs.