Inputs

Input CSV Fields

The input csv must have all columns below and in the same order. An example of an input csv can be found here

Field	Type	Description
read_group_identifier	string	The read group each read belongs to. This is concatenated with the `lane` column (see below) and then passed to the `ID` field of the final BAM. No white space is allowed. For more detail see here.
sequencing_center	string	The sequencing center where the data were produced. This is passed to the `CN` field of the final BAM. No white space is allowed. For more detail see here
library_identifier	string	The library identifier to be passed to the `LB` field of the final BAM. No white space is allowed. For more detail see here
platform_technology	string	The platform or technology used to produce the reads. This is passed to the `PL` field of the final BAM. No white space is allowed. For more detail see here
platform_unit	string	The platform unit to be passed to the `PU` field of the final BAM. No white space is allowed. For more detail see here
lane	string	The lane name or index. This is concatenated with the `read_group_identifier` column (see above) and then passed to the `ID` field of the final BAM. No white space is allowed. For more detail see here
sample	string	The sample name to be passed to the `SM` field of the final BAM. No white space is allowed. For more detail see here
read1_fastq	path	Absolute path to the R1 fastq file.
read2_fastq	path	Absolute path to the R2 fastq file.

Config File Parameters

Input Parameter	Required	Type	Description
`sample_name`	yes	string	The sample name. This is ignored if the output files are directly saved to the Boutros Lab data storage registry, by setting `ucla_cds_registered_dataset_output = true`
`input_csv`	yes	path	Absolute path to the input csv. See here for example and above for the detail of required fields.
`reference_fasta_bwa`	yes for BWA-MEM2	path	Absolute path to the reference genome `fasta` file. The reference genome is used by BWA-MEM2 for alignment.
`reference_fasta_hisat2`	yes for HISAT2	path	Absolute path to the reference genome `fasta` file. The reference genome is used by HISAT2 for alignment.
`hisat2_index_prefix`	yes for HISAT2	path	Absolute path up to the genome index basename. The index must be generated by the `hisat2-build` command.
`aligner`	yes	list	Which aligners to use as strings in list format. Current options: `BWA-MEM2, HISAT2`.
`output_dir`	yes	path	Absolute path to the directory where the output files to be saved. This is ignored if the output files are directly saved to the Boutros Lab data storage registry, by setting `ucla_cds_registered_dataset_output = true`
`save_intermediate_files`	yes	boolean	Save intermediate files. If yes, not only the final BAM, but also the unmerged, unsorted, and duplicates unmarked BAM files will also be saved.
`cache_intermediate_pipeline_steps`	yes	boolean	Enable cahcing to resume pipeline and the end of the last successful process completion when a pipeline fails (if true the default submission script must be modified).
`mark_duplicates`	no	boolean	Disable processes which mark duplicates. When false, the pipeline stops at the sorting step, outputting a sorted, indexed, unmerged BAM with unmarked duplicates. Recommended for high coverage targeted panel sequencing datasets. Defaults as true to mark duplicates as usual.
`enable_spark`	yes	boolean	Enable use of Spark processes. When true, `MarkDuplicatesSpark` will be used. When false, `MarkDuplicates` will be used. Default value is true.
`spark_temp_dir`	no	path	Path to temp dir for Spark processes. When included in the sample config file, Spark intermediate files will be saved to this directory. Defaults to `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively.
`spark_metrics`	no	boolean	should Spark generate *.mark_dup.metrics
`work_dir`	no	path	Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is `/scratch` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively.
`max_number_of_parallel_jobs`	no	int	The maximum number of jobs or steps of the pipeline that can be ran in parallel. Default is 1. Be very cautious setting this to any value larger than 1, as it may cause out-of-memory error. It may be helpful when running on a big memory computing node.
`bwa_mem_number_of_cpus`	no	int	Number of cores to use for BWA-MEM2. If not set, this will be calculated to ensure at least 2.5Gb memory per core.
`ucla_cds_registered_dataset_input`	yes	boolean	Input FASTQs are from the Boutros Lab data registry.
`ucla_cds_registered_dataset_output`	yes	boolean	Enable saving final files including BAM and BAM index, and logging files directory to the Boutros Lab Data registry.
`dataset_id`	no	string	The registered dataset ID of this dataset from the Boutros Lab data registry. Ignored if `ucla_cds_registered_data_input = true` or `ucla_cds_registered_output = false`
`patient_id`	no	string	The registered patient ID of this sample from the Boutros Lab data registry. Ignored if `ucla_cds_registered_data_input = true` or `ucla_cds_registered_output = false`
`sample_id`	no	string	The registered sample ID from the Boutros Lab data registry. Ignored if `ucla_cds_registered_data_input = true` or `ucla_cds_registered_output = false`
`disable_alt_aware`	yes	boolean	Whether to disable the default alt-aware mode for BWA-MEM2. The default behavior of alt-aware mode is to consider the `.alt` file if it exists in the directory with the reference file.
`ucla_cds_data_dir`	no	string	The directory where registered data is located. Default: `/hot/data`
`ucla_cds_reference_genome_version`	no	string	Identifier for the version of the reference genome version
`check_node_config`	no	boolean	Whether to check pre-configured node settings used to set CPU and memory constraints. The default behavior, whether `true` or undefined is to check the pre-configured node settings. Set to `false` to skip this check.
`docker_container_registry`	no	string	Registry containing tool Docker images. Default: `ghcr.io/uclahs-cds`