Pipeline Steps

Discovery

The "discovery" branch of the call-gSV pipeline allows you to identify germline SVs and CNVs utilizing either Delly or Manta. After variants are identified, basic quality checks are performed on the outputs of the processes.

1. Calling Structural Variants

The first step of the pipeline requires an aligned and sorted BAM file and BAM index as an input for variant calling with Delly or Manta. Delly combines short-range and long-range paired-end mapping and split-read analysis for the discovery of balanced and unbalanced SVs at single-nucleotide breakpoint resolution (deletions, tandem duplications, inversions and translocations.) SVs are called, annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data. Manta calls SVs and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.

Currently the following filters are applied by Delly when calling SVs. Parameters with a "call-gSV default" can be updated in the nextflow.config file.

Parameter Delly default call-gSV default Description
svtype ALL SV type to compute (DEL, INS, DUP, INV, BND, ALL)
map-qual 1 20 Minimum paired-end (PE) mapping quality
qual-tra 20 Minimum PE quality for translocation
mad-cutoff 9 Insert size cutoff, median+s*MAD (deletions only)
minclip 25 Minimum clipping length
min-clique-size 2 Minimum PE/SR clique size
minrefsep 25 Minimum reference separation
maxreadsep 40 Maximum read separation

2. Calling Copy Number Variants

The second step of the pipeline identifies CNVs. To do this, Delly requires an aligned and sorted BAM file, as well as the BCF output from the SV calling step (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.

Currently the following filters are applied by Delly when calling CNVs. Parameters with a "call-gSV default" can be updated in the sample specific nextflow config file.

Parameter Delly default call-gSV default Description
quality 10 Minimum mapping quality
ploidy 2 Baseline ploidy
sdrd 2 Minimum SD read-depth shift
cn-offset 0.100000001 Minimum CN offset
cnv-size 1000 Minimum CNV size
window-size 10000 Window size
window-offset 10000 Window offset
fraction-window 0.25 Minimum callable window fraction [0,1]
scan-window 10000 Scanning window size
fraction-unique 0.800000012 Uniqueness filter for scan windows [0,1]
mad-cutoff 3 Median + 3 * mad count cutoff
percentile 0.000500000024 Excl. extreme GC fraction

3. Check Output Quality

For Delly, VCF files are generated from the BCFs to run the vcf-validate command from VCFTools and vcfstats from RTGTools. Outputs from both provide preliminary summary statistics that can be viewed and evaluated in preparation for downstream cohort-wide re-calling and re-genotyping. In the Manta branch of the pipeline, a stats directory is generated under the specific output directory /Manta-/results/stats where information can be found regarding the SVs identified.

Regenotyping

The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.

1. Regenotyping Structural Variants

Similar to the "discovery" process, the first step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF (from the merge-SVsites pipeline) as inputs for SV regenotyping with Delly. The provided sample is genotyped with the merged sites list. SVs are annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data.

2. Regenotyping Copy Number Variants

The second possible step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.