Pipeline Steps

Discovery

The "discovery" branch of the call-gSV pipeline allows you to identify germline SVs and CNVs utilizing either Delly or Manta. After variants are identified, basic quality checks are performed on the outputs of the processes.

1. Calling Structural Variants

The first step of the pipeline requires an aligned and sorted BAM file and BAM index as an input for variant calling with Delly or Manta. Delly combines short-range and long-range paired-end mapping and split-read analysis for the discovery of balanced and unbalanced SVs at single-nucleotide breakpoint resolution (deletions, tandem duplications, inversions and translocations.) SVs are called, annotated and merged into a single BCF file which is then used to output a gzipped VCF file for user convenience. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data. Manta calls SVs and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.

Currently the following filters are applied by Delly when calling SVs. Parameters with a "call-gSV default" can be updated in the nextflow.config file.

Parameter	Delly default	call-gSV default	Description
`svtype`	ALL		SV type to compute (DEL, INS, DUP, INV, BND, ALL)
`map-qual`	1	20	Minimum paired-end (PE) mapping quality
`qual-tra`	20		Minimum PE quality for translocation
`mad-cutoff`	9		Insert size cutoff, median+s*MAD (deletions only)
`minclip`	25		Minimum clipping length
`min-clique-size`	2		Minimum PE/SR clique size
`minrefsep`	25		Minimum reference separation
`maxreadsep`	40		Maximum read separation

2. Calling Copy Number Variants

The second step of the pipeline identifies CNVs. To do this, Delly requires an aligned and sorted BAM file, as well as the BCF output from the SV calling step (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file. For convenience, the pipeline also outputs a gzipped CNV VCF file.

Currently the following filters are applied by Delly when calling CNVs. Parameters with a "call-gSV default" can be updated in the sample specific nextflow config file.

Parameter	Delly default	Description
`quality`	10	Minimum mapping quality
`ploidy`	2	Baseline ploidy
`sdrd`	2	Minimum SD read-depth shift
`cn-offset`	0.100000001	Minimum CN offset
`cnv-size`	1000	Minimum CNV size
`window-size`	10000	Window size
`window-offset`	10000	Window offset
`fraction-window`	0.25	Minimum callable window fraction [0,1]
`scan-window`	10000	Scanning window size
`fraction-unique`	0.800000012	Uniqueness filter for scan windows [0,1]
`mad-cutoff`	3	Median + 3 * mad count cutoff
`percentile`	0.000500000024	Excl. extreme GC fraction

3. Check Output Quality

For Delly, VCF files are generated from the BCFs to run the vcf-validate command from VCFTools and vcfstats from RTGTools. Outputs from both provide preliminary summary statistics that can be viewed and evaluated in preparation for downstream cohort-wide re-calling and re-genotyping. In the Manta branch of the pipeline, a stats directory is generated under the specific output directory /Manta-/results/stats where information can be found regarding the SVs identified.

Regenotyping

The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.

1. Regenotyping Structural Variants

Similar to the "discovery" process, the first step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF (from the merge-SVsites pipeline) as inputs for SV regenotyping with Delly. The provided sample is genotyped with the merged sites list. SVs are annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data.

2. Regenotyping Copy Number Variants

The second possible step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.