Inputs

To run the pipeline, one input.yaml and one input.config are needed, as follows.

input.yaml. (see template)

Input	Type	Description
patient_id	string	The name/ID of the patient
tumor_BAM	path	The path to the tumor .bam file (.bai file must exist in same directory)
normal_BAM	path	The path to the normal .bam file (.bai file must exist in same directory)
contamination_table	path	Optional, but only for tumor samples. The path of the `contamination.table`, which is generated from the GATK's `CalculateContamination` in `pipeline-call-gSNP`. The contamination.table path can be found under `pipeline-call-gSNP`'s output `QC` folder

input.yaml should follow the standardized structure:

patient_id: 'patient_id'
input:
  normal:
    - BAM: /path/to/normal.bam
  tumor:
    - BAM: /path/to/tumor.bam
      contamination_table: /path/to/contamination.table

Mutect2 can take other inputs: tumor-only sample and one patient's multiple samples. For tumor-only samples, remove the normal input in input.yaml, e.g. template_tumor_only.yaml. For multiple samples, put all the input BAMs in the input.yaml, e.g. template_multi_sample.yaml. Note, for these non-standard inputs, the configuration file must have 'mutect2' listed as the only algorithm.

input.config (see template)

Input	Required	Type	Description
`algorithm`	yes	list	List containing a combination of somaticsniper, strelka2, mutect2 and muse
`reference`	yes	string	The reference .fa file (.fai and .dict file must exist in same directory)
`intersect_regions`*	yes	string	A bed file listing the genomic regions for variant calling. Excluding `decoy` regions is HIGHLY recommended *
`output_dir`	yes	string	The location where outputs will be saved
`dataset_id`	yes	string	The name/ID of the dataset
`exome`	yes	boolean	The option will be used by `Strelka2` and `MuSE`. When `true`, it will add the `--exome` option to Manta and Strelka2, and `-E` option to MuSE
`save_intermediate_files`	yes	boolean	Whether to save intermediate files
`work_dir`	no	string	The path of working directory for Nextflow, storing intermediate files and logs. The default is `/scratch` with `ucla_cds` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively
`docker_container_registry`	no	string	Registry containing tool Docker images, optional. Default: `ghcr.io/uclahs-cds`
`base_resource_update`	optional	namespace	Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below.

*Providing intersect_regions is required and will limit the final output to just those regions. All regions of the reference genome could be provided as a bed file with all contigs, however it is HIGHLY recommended to remove decoy contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for Mutect2 and will cause a very long runtime for Strelka2. See Discussion here. For uclahs-cds users, a GRCh38 bed.gz file can be found here: /hot/resource/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz.

Base resource allocation updaters

To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the input.config file. The default allocations can be found in config/resources.json. If available resources have matched cpus and memory within 90% - 1GB of one of the pre-specified configurations, that configuration will be used. Otherwise the default configuration will be used. A spreadsheet view of the resource configuration as of Dec 2024 is here. For very large or challanging input samples, we suggest using the m64 configuration or similar.

base_resource_update {
    memory = [
        [['process_name', 'process_name2'], <multiplier for resource>],
        [['process_name3', 'process_name4'], <different multiplier for resource>]
    ]
    cpus = [
        [['process_name', 'process_name2'], <multiplier for resource>],
        [['process_name3', 'process_name4'], <different multiplier for resource>]
    ]
}

Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.

Examples:

To double memory of all processes:

base_resource_update {
    memory = [
        [[], 2]
    ]
}

To double memory for call_sSNV_Mutect2 and triple memory for run_validate_PipeVal and run_sump_MuSE:

base_resource_update {
    memory = [
        ['call_sSNV_Mutect2', 2],
        [['run_validate_PipeVal', 'run_sump_MuSE'], 3]
    ]
}

To double CPUs and memory for run_sump_MuSE and double memory for run_validate_PipeVal:

base_resource_update {
    cpus = [
        ['run_sump_MuSE', 2]
    ]
    memory = [
        [['run_sump_MuSE', 'run_validate_PipeVal'], 2]
    ]
}

Module Specific Configuration

Input	Required	Type	Description
bgzip_extra_args	no	string	The extra option used for compressing VCFs
tabix_extra_args	no	string	The extra option used for indexing VCFs

Mutect2 Specific Configuration

Input	Required	Type	Description
split_intervals_extra_args	no	string	Additional arguments for the SplitIntervals command
mutect2_extra_args	no	string	Additional arguments for the Mutect2 command
filter_mutect_calls_extra_args	no	string	Additional arguments for the FilterMutectCalls command
gatk_command_mem_diff	yes	nextflow.util.MemoryUnit	How much to subtract from the task's allocated memory where the remainder is the Java heap max. (should not be changed unless task fails for memory related reasons)
scatter_count	yes	int	Number of intervals to split the desired interval into. Mutect2 will call each interval seperately.
germline_resource_gnomad_vcf	no	path	A stripped down version of the gnomAD VCF stripped of all unneeded INFO fields, keeping only AF, currently available for GRCh38:`/hot/resource/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz` and GRCh37: `/hot/resource/tool-specific-input/GATK/GRCh37/af-only-gnomad.raw.sites.vcf`.
panel_of_normals_vcf	no	path	VCF file of sites observed in normal. Currently available for GRCh38: `/hot/resource/tool-specific-input/GATK/GRCh38/1000g_pon.hg38.vcf.gz`. This could be useful for tumor only mode.

MuSE Specific Configuration

Input	Required	Type	Description
dbSNP	yes	path	The path to NCBI's dbSNP database of known SNPs in VCF format, e.g. `GCF_000001405.40.gz`

Variant Intersection Specific Configuration

Input	Required	Type	Description
ncbi_build	yes	string	vcf2maf requires the reference genome build ID, e.g. GRCh38
vcf2maf_extra_args	no	string	additional arguments for the vcf2maf command