Pipeline Steps
1. Split genome or target intervals into sub-intervals for parallelization
Use the input target intervals or the whole genome intervals and split them into sub-intervals for parallel processing.
2. HaplotypeCaller
Generate VCF for each split interval using HaplotypeCaller. Generate GVCF for SNPs and INDELs.
3. Merge raw VCFs and GVCFs
Merge raw variants from each interval.
4. VQSR - SNPs
Generate VQSR (Variant Quality Score Recalibration) model for SNPs.
5. VQSR - INDELs
Generate VQSR model for INDELs.
6. VQSR - Apply SNP model
Take the whole sample raw VCF from Step 3 as input, and apply the model in Step 4 to generate variants in which only SNPs are recalibrated.
7. VQSR Apply INDEL model
Take the output from Step 6 as input, and apply the model in Step 5 to recalibrate only INDELs.
Steps 4 through 7 model the technical profile of variants in a training set and uses that to filter out probable artifacts from the raw VCF. After these four steps, a recalibrated VCF is generated.
8. Filter gSNP – Filter out ambiguous variants
Use customized Perl script to filter out ambiguous variants.
9. Generate sha512 checksum
Generate sha512 checksum for VCFs and GVCFs.