Pipeline Steps
Discovery
The "discovery" branch of the call-gSV pipeline allows you to identify germline SVs and CNVs utilizing either Delly or Manta. After variants are identified, basic quality checks are performed on the outputs of the processes.
1. Calling Structural Variants
The first step of the pipeline requires an aligned and sorted BAM file and BAM index as an input for variant calling with Delly or Manta. Delly combines short-range and long-range paired-end mapping and split-read analysis for the discovery of balanced and unbalanced SVs at single-nucleotide breakpoint resolution (deletions, tandem duplications, inversions and translocations.) SVs are called, annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data. Manta calls SVs and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.
Currently the following filters are applied by Delly when calling SVs. Parameters with a "call-gSV default" can be updated in the nextflow.config file.
Parameter | Delly default | call-gSV default | Description |
---|---|---|---|
svtype |
ALL | SV type to compute (DEL, INS, DUP, INV, BND, ALL) | |
map-qual |
1 | 20 | Minimum paired-end (PE) mapping quality |
qual-tra |
20 | Minimum PE quality for translocation | |
mad-cutoff |
9 | Insert size cutoff, median+s*MAD (deletions only) | |
minclip |
25 | Minimum clipping length | |
min-clique-size |
2 | Minimum PE/SR clique size | |
minrefsep |
25 | Minimum reference separation | |
maxreadsep |
40 | Maximum read separation | |
2. Calling Copy Number Variants
The second step of the pipeline identifies CNVs. To do this, Delly requires an aligned and sorted BAM file, as well as the BCF output from the SV calling step (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.
Currently the following filters are applied by Delly when calling CNVs. Parameters with a "call-gSV default" can be updated in the sample specific nextflow config file.
Parameter | Delly default | call-gSV default | Description |
---|---|---|---|
quality |
10 | Minimum mapping quality | |
ploidy |
2 | Baseline ploidy | |
sdrd |
2 | Minimum SD read-depth shift | |
cn-offset |
0.100000001 | Minimum CN offset | |
cnv-size |
1000 | Minimum CNV size | |
window-size |
10000 | Window size | |
window-offset |
10000 | Window offset | |
fraction-window |
0.25 | Minimum callable window fraction [0,1] | |
scan-window |
10000 | Scanning window size | |
fraction-unique |
0.800000012 | Uniqueness filter for scan windows [0,1] | |
mad-cutoff |
3 | Median + 3 * mad count cutoff | |
percentile |
0.000500000024 | Excl. extreme GC fraction | |
3. Check Output Quality
For Delly, VCF files are generated from the BCFs to run the vcf-validate command from VCFTools and vcfstats from RTGTools. Outputs from both provide preliminary summary statistics that can be viewed and evaluated in preparation for downstream cohort-wide re-calling and re-genotyping. In the Manta branch of the pipeline, a stats directory is generated under the specific output directory
Regenotyping
The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.
1. Regenotyping Structural Variants
Similar to the "discovery" process, the first step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF (from the merge-SVsites pipeline) as inputs for SV regenotyping with Delly. The provided sample is genotyped with the merged sites list. SVs are annotated and merged into a single BCF file. A default exclude map of Delly can be incorporated as an input which removes the telomeric and centromeric regions of all human chromosomes since these regions cannot be accurately analyzed with short-read data.
2. Regenotyping Copy Number Variants
The second possible step of the regenotyping pipeline requires an aligned and sorted BAM file, BAM index, and a merged sites BCF as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.