Pipeline Steps
1. Depth Calculation
Per-base depth is calculated from the input BAM file at coordinates specified by the input target BED file using
samtools depth
. Ifoff_target_depth
is set totrue
, per-base read depth is also calculated genome-wide at dbSNP loci with a dbSNP reference VCF used as the coordinate file tosamtools depth
.
2. BED Formatting
TSV output from
samtools depth
is converted into BED format usingawk
with read depth reported in the fourth column. Per-base read depth across multiple-base-pair target intervals is collapsed into a comma-separated list of read depth values, one for each base pair encompassed by the interval (bedtools merge
).
3. dbSNP off-target site filtering
dbSNP coordinates are filtered to keep only off-target regions. This is done by excluding coordinates specified in the target BED file from the dbSNP read depth BED using
bedtools intersect
. Near-target regions (+/- 500bp by default) are also excluded by first adding near-target buffers to the specified target intervals usingbedtools slop
.
4. dbSNP enriched read depth filtering
dbSNP coordinates from step 2 are filtered to keep sites exceeding a minimum read depth threshold (30x by default) using
awk
.
5. dbSNP coverage-enriched interval expansion
Filtered dbSNP coordinates from step 4 are expanded to include nearby basepairs, so that sites that are close together can be subsequently be merged into one interval (
bedtools slop
).
6. On-target and enriched off-target interval merging
Coverage enriched dbSNP intervals are merged with the original target intervals into one BED file using a series of bash commands that concatenate and sort the two files, then merge with
bedtools
.
7. Metrics Reporting
Target BED file and optional bait file are converted to INTERVAL_LIST format using picard
BedToIntervalList
then used to report metrics on input BAM with picardCollectHsMetrics
.