How To Run

Requirements Currently supported Nextflow versions: 23.04.2

Below is a summary of how to run the pipeline. See here for full instructions.

Pipelines should be run WITH A SINGLE SAMPLE AT TIME. Otherwise resource allocation and Nextflow errors could cause the pipeline to fail.

Note: Because this pipeline uses images stored in the GitHub Container Registry, you must setup a personal access token (PAT) for your GitHub account and log into the registry on the cluster before running this pipeline.

  1. The recommended way of running the pipeline is to directly use the source code located here: /hot/software/pipeline/pipeline-align-DNA/Nextflow/release, rather than cloning a copy of the pipeline.

  2. The source code should never be modified when running our pipelines

  3. Create a config file for input, output, and parameters. An example for a config file can be found here. See Inputs for the detailed description of each variable in the config file. The config file can be generated using a python script (see below).

  4. Do not directly modify the source template.config, but rather you should copy it from the pipeline release folder to your project-specific folder and modify it there

  5. Create the input csv using the template. The example csv is a single-lane sample, however this pipeline can take multi-lane sample as well, with each record in the csv file representing a lane (a pair of fastq). All records must have the same value in the sample column. See Inputs for detailed description of each column. All columns must exist in order to run the pipeline successfully.

  6. Again, do not directly modify the source template csv file. Instead, copy it from the pipeline release folder to your project-specific folder and modify it there.

  7. The pipeline can be executed locally using the command below:

nextflow run path/to/main.nf -config path/to/sample-specific.config
  • For example, path/to/main.nf could be: /hot/software/pipeline/pipeline-align-DNA/Nextflow/release/8.0.0/pipeline/align-DNA.nf
  • path/to/sample-specific.config is the path to where you saved your project-specific copy of template.config

To submit to UCLAHS-CDS's Azure cloud, use the submission script here with the command below:

python path/to/submit_nextflow_pipeline.py \
    --nextflow_script path/to/main.nf \
    --nextflow_config path/to/sample-specific.config \
    --pipeline_run_name <sample_name> \
    --partition_type F72 \
    --email jdoe@ucla.edu

BWA-MEM2 Genome Index The reference genome index must be generated by BWA-MEM2 with the correct version. Genome index generated by old BWA-MEM2 versions or the original BWA is not accepted. The reference genome index can be generated using the generate-genome-index.nf nextflow pipeline. To run this pipeline, you need to create a config file using this template to specify the path of reference_fasta and the temp_dir. The temp_dir is used to store intermediate files of Nextflow. The genome index files are saved to the same directory of the input reference FASTA by the pipeline. Use the command below to run this generate genome index pipeline:

nextflow run path/to/generate-genome-index.nf -config path/to/genome-specific.config

This can also be submitted using the submission script to the UCLAHS-CDS's Azure cloud as mentioned above.

The BWA-MEM2 expects the reference genome index to be at the same directory as the reference genome FASTA, so it's important to keep them together.

HISAT2 Genome Index The reference genome index must be generated from HISAT2 using hisat2-build. When passing the hisat2 index to the config, only the path up to the prefix(basename) must be specified:

The basename is the name of any of the index files up to but not including the final .1.ht2 / etc. hisat2 looks for the specified index first in the current directory, then in the directory specified in the HISAT2_INDEXES environment variable.

Generating the config file using a script

To learn how to run the script, use one of the following commands:

python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py -h
python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py param
python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py example

See the following command for example:

python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py \
    /my/path/to/sample_name.csv \
    bwa-mem2 \
    /hot/ref/tool-specific-input/BWA-MEM2-2.2.1/GRCh38-BI-20160721/index/genome.fa \
    /my/path/to/output_directory \
    /my/path/to/temp_directory \
    --save_intermediate_files \
    --cache_intermediate_pipeline_steps