Testing and Validation
Test Data Set
Testing was performed leveraging aligned and sorted BAMs generated using bwa-mem2-2.1
against reference GRCh38 (SMC-HET was aligned against hs37d5):
- A-mini: BWA-MEM2-2.1_TEST0000000_TWGSAMIN000001-T001-S01-F.bam
- A-partial: BWA-MEM2-2.1_TEST0000000_TWGSAPRT000001-T001-S01-F.bam
- A-full: a-full-CPCG0196-B1.bam*
- A-partial: CPCG0196-B1-downsampled-a-partial-sorted.bam*
- SMC-HET: HG002.N.bam
* In Delly v1.1.3
, a coverage check
has been introduced which checks for coverage quality in a given window before CNV calling. Successful CNV calling was observed on samples with coverages across the genome, such as, a-full-CPCG0196-B1.bam
and CPCG0196-B1-downsampled-a-partial-sorted.bam
(WGS samples). For more details, please refer to Discussion #64.
Test runs for the A-mini/partial/full samples were performed using the following reference files
- reference_fasta: /hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta
- exclusion_file: /hot/ref/tool-specific-input/Delly/GRCh38/human.hg38.excl.tsv
- mappability_map: /hot/ref/tool-specific-input/Delly/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.r101.s501.blacklist.gz
Performance Validation
with Delly \<= v0.9.1
in the pipeline
Testing was performed primarily in the Boutros Lab Slurm Development cluster but additional functional tests were performed on the SGE cluster on 2/26/2021 and the Slurm Covid cluster. Metrics below will be updated where relevant with additional testing and tuning outputs.
Test Case | Test Date | Node Type | Duration | CPU Hours | Virtual Memory Usage (RAM) -peak rss |
---|---|---|---|---|---|
A-mini | 2021-02-12 | F2 | 1m 29s | a few seconds | 208.8 MB |
A-partial | 2021-02-10 | F72 | 42m 5s | 48.8 | 8.9 GB |
A-full | 2021-02-10 | F72 | 7h 10m 43s | 509.0 | 10.9 GB |
SMC-HET | 2021-02-12 | F72 | 3h 9m 60s | 223.5 | 8.9 GB |
with Delly >= v1.0.3
in the pipeline
Metrics below are based on the integration of Delly v1.13 in the call-gSV
pipeline.
Test Case | Test Date | Node Type | Duration | CPU Hours | Virtual Memory Usage (RAM) -peak rss |
---|---|---|---|---|---|
CPCG0196-B1 A-partial | 2022-08-08 | F72 | 1h 9m 15s | 2.2 | 10.85 GB |
CPCG0196-B1 A-full | 2022-08-06 | F72 | 21h 3m 19s | 37.3 | 24.68 GB |
Quality Check Result Comparison
Metric | A-mini | A-partial | A-full | SMC-HET | Source |
---|---|---|---|---|---|
Count Pass | 3 | 2593 | 62704 | 15196 | grep -c -w "PASS" filename.vcf -1 |
Count Deletion | 2 | 1475 | 49433 | 9317 | grep -c -w "SVTYPE=DEL" filename.vcf |
Count Duplication | 1 | 170 | 2311 | 1705 | grep -c -w "SVTYPE=DUP" filename.vcf |
Count Inversion | 0 | 317 | 2801 | 2197 | grep -c -w "SVTYPE=INV" filename.vcf |
Count Translocation | 0 | 384 | 7439 | 0 | grep -c -w "SVTYPE=BND" filename.vcf |
Count Insertion | 0 | 267 | 1265 | 2059 | grep -c -w "SVTYPE=INS" filename.vcf |
PRECISE Calls | 3 | 1850 | 11541 | 8267 | grep -c -w "PRECISE" filename.vcf |
IMPRECISE Calls | 2 | 764 | 51709 | 7012 | grep -c -w "IMPRECISE" filename.vcf |
Failed Filters | 0 | 653 | 44991 | 2619 | .stats.txt |
Passed Filters | 3 | 1959 | 18257 | 12658 | .stats.txt |
SV breakends | 0 | 219 | 1124 | 0 | .stats.txt |
Symbolic SVs | 2 | 1559 | 12500 | 11156 | .stats.txt |
Same as reference | 1 | 263 | 4595 | 1471 | .stats.txt |
Missing Genotype | 0 | 8 | 38 | 31 | .stats.txt |
Total Het/Hom ratio | (2/0) | 1.00 (843/845) | 2.37 (9580/4044) | 1.86 (7251/3905) | .stats.txt |
Breakend Het/Hom ratio | (0/0) | 0.84 (59/70) | 13.41 (1046/78) | (0/0) | .stats.txt |
Symbolic SV Het/Hom ratio | (2/0) | 1.01 (784/775) | 2.15 (8534/3966) | 1.86 (7251/3905) | .stats.txt |
Duplicate entries | 0 errors total | 1 error chr8:3893339 | 1 error chr1:16050024 | 1 error chr1:187464829 | .validate.txt |
Human Genome Benchmarks
Note, per Nature the following benchmarks exist for the human genome: “Structural variants affect more bases: the typical genome contains an estimated 2,100 to 2,500 structural variants (∼1,000 large deletions, ∼160 copy-number variants, ∼915 Alu insertions, ∼128 L1 insertions, ∼51 SVA insertions, ∼4 NUMTs, and ∼10 inversions), affecting ∼20 million bases of sequence.”
Validation Tool
Included is a template for validating your input files. For more information on the tool check out: https://github.com/uclahs-cds/public-tool-PipeVal