Pipeline Steps

1. Alignment

The first step of the pipeline utilizes BWA-MEM2 or HISAT2 to align paired reads. BWA-MEM2 is the successor for the well-known aligner BWA. The bwa-mem2 mem command utilizes the -M option for marking shorter splits as secondary. This allows for compatibility with Picard Tools in downstream process and in particular prevents the underlying library of Picard Tools from recognizing these splits as duplicate reads (read names). Additionally, the -t option is utilized to increase the number of threads used for alignment. The number of threads used in this step is by default to allow at least 2.5Gb memory per CPU, because of the large memory usage by BWA-MEM2. This can be overwritten by setting the bwa_mem_number_of_cpus parameter from the config file.

2. Convert Align SAM File to BAM Format

SAMtools view command is used to convert the aligned SAM files into the compressed BAM format. The SAMtools view command utilizes the -S option for increasing the speed by removing duplicates and outputs the reads as they are ordered in the file. Additionally, the -b option ensures the output is in BAM format and the -@ option is utilized to increase the number of threads.

3. Sort BAM Files in Coordinate or Queryname Order

The next step of the pipeline utilizes SAMtools sort command to sort the aligned BAM files in coordinate order or queryname order that is needed for downstream duplicate marking tools. Specifically, the sort_order option is utilized to ensure the file is sorted in coordinate order for Picard or queryname order for Spark.

For certain use-cases the pipeline may be configured to stop after this step using the mark_duplicates=false parameter in the config file. This option is intended for datasets generated with targeted sequencing panels (like our custom Proseq-G Prostate panel). High coverage target enrichment sequencing (like Illumina's protocol) results in a large amount of read duplication that is not an artifact of PCR amplification. Marking these reads as duplicates will severely reduce coverage, and it is recommended that the pipeline be configured to not mark duplicates in this case.

4. Mark Duplicates in BAM Files

If mark_duplicates=true then the next step of the pipeline utilizes Picard Tool’s MarkDuplicates command to mark duplicates in the BAM files. The MarkDuplicates command utilizes the VALIDATION_STRINGENCY=LENIENT option to indicate how errors should be handled and keep the process running if possible. Additionally, the Program_Record_Id is set to MarkDuplicates.

A faster Spark implementation of MarkDuplicates can also be used (MarkDuplicatesSpark from GATK). The process matches the output of Picard's MarkDuplicates with significant runtime improvements. An important note, however, the Spark version requires more disk space and can fail with large inputs with multiple aligners being specified due to insufficient disk space. In such cases, Picard's MarkDuplicates should be used instead.

5. Index BAM Files

After marking duplicated reads in BAM files, the BAM files are then indexed by using --CREATE_INDEX true for Picard's MarkDuplicates, or --create-output-bam-index for MarkDuplicatesSpark. This utilizes the VALIDATION_STRINGENCY=LENIENT option to indicate how errors should be handled and keep the process running if possible.