To run the pipeline, one input.yaml
and one input.config
are needed, as follows.
Input |
Type |
Description |
patient_id |
string |
The name/ID of the patient |
tumor_BAM |
path |
The path to the tumor .bam file (.bai file must exist in same directory) |
normal_BAM |
path |
The path to the normal .bam file (.bai file must exist in same directory) |
contamination_table |
path |
Optional, but only for tumor samples. The path of the contamination.table , which is generated from the GATK's CalculateContamination in pipeline-call-gSNP . The contamination.table path can be found under pipeline-call-gSNP 's output QC folder |
input.yaml
should follow the standardized structure:
patient_id: 'patient_id'
input:
normal:
- BAM: /path/to/normal.bam
tumor:
- BAM: /path/to/tumor.bam
contamination_table: /path/to/contamination.table
Mutect2
can take other inputs: tumor-only sample and one patient's multiple samples. For tumor-only samples, remove the normal input in input.yaml
, e.g. template_tumor_only.yaml. For multiple samples, put all the input BAMs in the input.yaml
, e.g. template_multi_sample.yaml. Note, for these non-standard inputs, the configuration file must have 'mutect2' listed as the only algorithm.
Input |
Required |
Type |
Description |
algorithm |
yes |
list |
List containing a combination of somaticsniper, strelka2, mutect2 and muse |
reference |
yes |
string |
The reference .fa file (.fai and .dict file must exist in same directory) |
intersect_regions * |
yes |
string |
A bed file listing the genomic regions for variant calling. Excluding decoy regions is HIGHLY recommended * |
output_dir |
yes |
string |
The location where outputs will be saved |
dataset_id |
yes |
string |
The name/ID of the dataset |
exome |
yes |
boolean |
The option will be used by Strelka2 and MuSE . When true , it will add the --exome option to Manta and Strelka2, and -E option to MuSE |
save_intermediate_files |
yes |
boolean |
Whether to save intermediate files |
work_dir |
no |
string |
The path of working directory for Nextflow, storing intermediate files and logs. The default is /scratch with ucla_cds and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively |
docker_container_registry |
no |
string |
Registry containing tool Docker images, optional. Default: ghcr.io/uclahs-cds |
base_resource_update |
optional |
namespace |
Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in template.config and below. |
*Providing intersect_regions
is required and will limit the final output to just those regions. All regions of the reference genome could be provided as a bed
file with all contigs, however it is HIGHLY recommended to remove decoy
contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for Mutect2
and will cause a very long runtime for Strelka2
. See Discussion here. For uclahs-cds
users, a GRCh38 bed.gz
file can be found here: /hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz
.
Base resource allocation updaters
To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the input.config file. The default allocations can be found in the node-specific config files
base_resource_update {
memory = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
cpus = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
}
Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.
Examples:
- To double memory of all processes:
base_resource_update {
memory = [
[[], 2]
]
}
- To double memory for
call_sSNV_Mutect2
and triple memory for run_validate_PipeVal
and run_sump_MuSE
:
base_resource_update {
memory = [
['call_sSNV_Mutect2', 2],
[['run_validate_PipeVal', 'run_sump_MuSE'], 3]
]
}
- To double CPUs and memory for
run_sump_MuSE
and double memory for run_validate_PipeVal
:
base_resource_update {
cpus = [
['run_sump_MuSE', 2]
]
memory = [
[['run_sump_MuSE', 'run_validate_PipeVal'], 2]
]
}
Module Specific Configuration
Input |
Required |
Type |
Description |
bgzip_extra_args |
no |
string |
The extra option used for compressing VCFs |
tabix_extra_args |
no |
string |
The extra option used for indexing VCFs |
Mutect2 Specific Configuration
Input |
Required |
Type |
Description |
split_intervals_extra_args |
no |
string |
Additional arguments for the SplitIntervals command |
mutect2_extra_args |
no |
string |
Additional arguments for the Mutect2 command |
filter_mutect_calls_extra_args |
no |
string |
Additional arguments for the FilterMutectCalls command |
gatk_command_mem_diff |
yes |
nextflow.util.MemoryUnit |
How much to subtract from the task's allocated memory where the remainder is the Java heap max. (should not be changed unless task fails for memory related reasons) |
scatter_count |
yes |
int |
Number of intervals to split the desired interval into. Mutect2 will call each interval seperately. |
germline_resource_gnomad_vcf |
no |
path |
A stripped down version of the gnomAD VCF stripped of all unneeded INFO fields, keeping only AF, currently available for GRCh38:/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz and GRCh37: /hot/ref/tool-specific-input/GATK/GRCh37/af-only-gnomad.raw.sites.vcf . |
MuSE Specific Configuration
Input |
Required |
Type |
Description |
dbSNP |
yes |
path |
The path to NCBI's dbSNP database of known SNPs in VCF format, e.g. GCF_000001405.40.gz |
Variant Intersection Specific Configuration
Input |
Required |
Type |
Description |
ncbi_build |
yes |
string |
vcf2maf requires the reference genome build ID, e.g. GRCh38 |
vcf2maf_extra_args |
no |
string |
additional arguments for the vcf2maf command |