Inputs

To run the pipeline, one input.yaml and one input.config are needed, as follows.

input.yaml. (see template)

Input Type Description
patient_id string The name/ID of the patient
tumor_BAM path The path to the tumor .bam file (.bai file must exist in same directory)
normal_BAM path The path to the normal .bam file (.bai file must exist in same directory)
contamination_table path Optional, but only for tumor samples. The path of the contamination.table, which is generated from the GATK's CalculateContamination in pipeline-call-gSNP. The contamination.table path can be found under pipeline-call-gSNP's output QC folder
  • input.yaml should follow the standardized structure:
patient_id: 'patient_id'
input:
  normal:
    - BAM: /path/to/normal.bam
  tumor:
    - BAM: /path/to/tumor.bam
      contamination_table: /path/to/contamination.table
  • Mutect2 can take other inputs: tumor-only sample and one patient's multiple samples. For tumor-only samples, remove the normal input in input.yaml, e.g. template_tumor_only.yaml. For multiple samples, put all the input BAMs in the input.yaml, e.g. template_multi_sample.yaml. Note, for these non-standard inputs, the configuration file must have 'mutect2' listed as the only algorithm.

input.config (see template)

Input Required Type Description
algorithm yes list List containing a combination of somaticsniper, strelka2, mutect2 and muse
reference yes string The reference .fa file (.fai and .dict file must exist in same directory)
intersect_regions* yes string A bed file listing the genomic regions for variant calling. Excluding decoy regions is HIGHLY recommended *
output_dir yes string The location where outputs will be saved
dataset_id yes string The name/ID of the dataset
exome yes boolean The option will be used by Strelka2 and MuSE. When true, it will add the --exome option to Manta and Strelka2, and -E option to MuSE
save_intermediate_files yes boolean Whether to save intermediate files
work_dir no string The path of working directory for Nextflow, storing intermediate files and logs. The default is /scratch with ucla_cds and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively
docker_container_registry no string Registry containing tool Docker images, optional. Default: ghcr.io/uclahs-cds
base_resource_update optional namespace Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in template.config and below.

*Providing intersect_regions is required and will limit the final output to just those regions. All regions of the reference genome could be provided as a bed file with all contigs, however it is HIGHLY recommended to remove decoy contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for Mutect2 and will cause a very long runtime for Strelka2. See Discussion here. For uclahs-cds users, a GRCh38 bed.gz file can be found here: /hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz.

Base resource allocation updaters

To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the input.config file. The default allocations can be found in the node-specific config files

base_resource_update {
    memory = [
        [['process_name', 'process_name2'], <multiplier for resource>],
        [['process_name3', 'process_name4'], <different multiplier for resource>]
    ]
    cpus = [
        [['process_name', 'process_name2'], <multiplier for resource>],
        [['process_name3', 'process_name4'], <different multiplier for resource>]
    ]
}

Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.

Examples:

  • To double memory of all processes:
base_resource_update {
    memory = [
        [[], 2]
    ]
}
  • To double memory for call_sSNV_Mutect2 and triple memory for run_validate_PipeVal and run_sump_MuSE:
base_resource_update {
    memory = [
        ['call_sSNV_Mutect2', 2],
        [['run_validate_PipeVal', 'run_sump_MuSE'], 3]
    ]
}
  • To double CPUs and memory for run_sump_MuSE and double memory for run_validate_PipeVal:
base_resource_update {
    cpus = [
        ['run_sump_MuSE', 2]
    ]
    memory = [
        [['run_sump_MuSE', 'run_validate_PipeVal'], 2]
    ]
}

Module Specific Configuration

Input Required Type Description
bgzip_extra_args no string The extra option used for compressing VCFs
tabix_extra_args no string The extra option used for indexing VCFs

Mutect2 Specific Configuration

Input Required Type Description
split_intervals_extra_args no string Additional arguments for the SplitIntervals command
mutect2_extra_args no string Additional arguments for the Mutect2 command
filter_mutect_calls_extra_args no string Additional arguments for the FilterMutectCalls command
gatk_command_mem_diff yes nextflow.util.MemoryUnit How much to subtract from the task's allocated memory where the remainder is the Java heap max. (should not be changed unless task fails for memory related reasons)
scatter_count yes int Number of intervals to split the desired interval into. Mutect2 will call each interval seperately.
germline_resource_gnomad_vcf no path A stripped down version of the gnomAD VCF stripped of all unneeded INFO fields, keeping only AF, currently available for GRCh38:/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz and GRCh37: /hot/ref/tool-specific-input/GATK/GRCh37/af-only-gnomad.raw.sites.vcf.

MuSE Specific Configuration

Input Required Type Description
dbSNP yes path The path to NCBI's dbSNP database of known SNPs in VCF format, e.g. GCF_000001405.40.gz

Variant Intersection Specific Configuration

Input Required Type Description
ncbi_build yes string vcf2maf requires the reference genome build ID, e.g. GRCh38
vcf2maf_extra_args no string additional arguments for the vcf2maf command