Testing and Validation

Test Data Set

Testing was performed leveraging aligned and sorted BAMs generated using bwa-mem2-2.1 against reference GRCh38 (SMC-HET was aligned against hs37d5):

  • A-mini: BWA-MEM2-2.1_TEST0000000_TWGSAMIN000001-T001-S01-F.bam
  • A-partial: BWA-MEM2-2.1_TEST0000000_TWGSAPRT000001-T001-S01-F.bam
  • A-full: a-full-CPCG0196-B1.bam*
  • A-partial: CPCG0196-B1-downsampled-a-partial-sorted.bam*
  • SMC-HET: HG002.N.bam

* In Delly v1.1.3, a coverage check has been introduced which checks for coverage quality in a given window before CNV calling. Successful CNV calling was observed on samples with coverages across the genome, such as, a-full-CPCG0196-B1.bam and CPCG0196-B1-downsampled-a-partial-sorted.bam (WGS samples). For more details, please refer to Discussion #64.

Test runs for the A-mini/partial/full samples were performed using the following reference files

  • reference_fasta: /hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta
  • exclusion_file: /hot/ref/tool-specific-input/Delly/GRCh38/human.hg38.excl.tsv
  • mappability_map: /hot/ref/tool-specific-input/Delly/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.r101.s501.blacklist.gz

Performance Validation

with Delly \<= v0.9.1 in the pipeline

Testing was performed primarily in the Boutros Lab Slurm Development cluster but additional functional tests were performed on the SGE cluster on 2/26/2021 and the Slurm Covid cluster. Metrics below will be updated where relevant with additional testing and tuning outputs.

Test Case Test Date Node Type Duration CPU Hours Virtual Memory Usage (RAM) -peak rss
A-mini 2021-02-12 F2 1m 29s a few seconds 208.8 MB
A-partial 2021-02-10 F72 42m 5s 48.8 8.9 GB
A-full 2021-02-10 F72 7h 10m 43s 509.0 10.9 GB
SMC-HET 2021-02-12 F72 3h 9m 60s 223.5 8.9 GB

with Delly >= v1.0.3 in the pipeline

Metrics below are based on the integration of Delly v1.13 in the call-gSV pipeline.

Test Case Test Date Node Type Duration CPU Hours Virtual Memory Usage (RAM) -peak rss
CPCG0196-B1 A-partial 2022-08-08 F72 1h 9m 15s 2.2 10.85 GB
CPCG0196-B1 A-full 2022-08-06 F72 21h 3m 19s 37.3 24.68 GB

Quality Check Result Comparison

Metric A-mini A-partial A-full SMC-HET Source
Count Pass 3 2593 62704 15196 grep -c -w "PASS" filename.vcf -1
Count Deletion 2 1475 49433 9317 grep -c -w "SVTYPE=DEL" filename.vcf
Count Duplication 1 170 2311 1705 grep -c -w "SVTYPE=DUP" filename.vcf
Count Inversion 0 317 2801 2197 grep -c -w "SVTYPE=INV" filename.vcf
Count Translocation 0 384 7439 0 grep -c -w "SVTYPE=BND" filename.vcf
Count Insertion 0 267 1265 2059 grep -c -w "SVTYPE=INS" filename.vcf
PRECISE Calls 3 1850 11541 8267 grep -c -w "PRECISE" filename.vcf
IMPRECISE Calls 2 764 51709 7012 grep -c -w "IMPRECISE" filename.vcf
Failed Filters 0 653 44991 2619 .stats.txt
Passed Filters 3 1959 18257 12658 .stats.txt
SV breakends 0 219 1124 0 .stats.txt
Symbolic SVs 2 1559 12500 11156 .stats.txt
Same as reference 1 263 4595 1471 .stats.txt
Missing Genotype 0 8 38 31 .stats.txt
Total Het/Hom ratio (2/0) 1.00 (843/845) 2.37 (9580/4044) 1.86 (7251/3905) .stats.txt
Breakend Het/Hom ratio (0/0) 0.84 (59/70) 13.41 (1046/78) (0/0) .stats.txt
Symbolic SV Het/Hom ratio (2/0) 1.01 (784/775) 2.15 (8534/3966) 1.86 (7251/3905) .stats.txt
Duplicate entries 0 errors total 1 error chr8:3893339 1 error chr1:16050024 1 error chr1:187464829 .validate.txt

Human Genome Benchmarks

Note, per Nature the following benchmarks exist for the human genome: “Structural variants affect more bases: the typical genome contains an estimated 2,100 to 2,500 structural variants (∼1,000 large deletions, ∼160 copy-number variants, ∼915 Alu insertions, ∼128 L1 insertions, ∼51 SVA insertions, ∼4 NUMTs, and ∼10 inversions), affecting ∼20 million bases of sequence.”

Validation Tool

Included is a template for validating your input files. For more information on the tool check out: https://github.com/uclahs-cds/public-tool-PipeVal