Inputs

Input CSV Fields

The input csv must have all columns below and in the same order. An example of an input csv can be found here

Field Type Description
read_group_identifier string The read group each read belongs to. This is concatenated with the lane column (see below) and then passed to the ID field of the final BAM. No white space is allowed. For more detail see here.
sequencing_center string The sequencing center where the data were produced. This is passed to the CN field of the final BAM. No white space is allowed. For more detail see here
library_identifier string The library identifier to be passed to the LB field of the final BAM. No white space is allowed. For more detail see here
platform_technology string The platform or technology used to produce the reads. This is passed to the PL field of the final BAM. No white space is allowed. For more detail see here
platform_unit string The platform unit to be passed to the PU field of the final BAM. No white space is allowed. For more detail see here
lane string The lane name or index. This is concatenated with the read_group_identifier column (see above) and then passed to the ID field of the final BAM. No white space is allowed. For more detail see here
sample string The sample name to be passed to the SM field of the final BAM. No white space is allowed. For more detail see here
read1_fastq path Absolute path to the R1 fastq file.
read2_fastq path Absolute path to the R2 fastq file.

Config File Parameters

Input Parameter Required Type Description
sample_name yes string The sample name. This is ignored if the output files are directly saved to the Boutros Lab data storage registry, by setting ucla_cds_registered_dataset_output = true
input_csv yes path Absolute path to the input csv. See here for example and above for the detail of required fields.
reference_fasta_bwa yes for BWA-MEM2 path Absolute path to the reference genome fasta file. The reference genome is used by BWA-MEM2 for alignment.
reference_fasta_hisat2 yes for HISAT2 path Absolute path to the reference genome fasta file. The reference genome is used by HISAT2 for alignment.
hisat2_index_prefix yes for HISAT2 path Absolute path up to the genome index basename. The index must be generated by the hisat2-build command.
aligner yes list Which aligners to use as strings in list format. Current options: BWA-MEM2, HISAT2.
output_dir yes path Absolute path to the directory where the output files to be saved. This is ignored if the output files are directly saved to the Boutros Lab data storage registry, by setting ucla_cds_registered_dataset_output = true
save_intermediate_files yes boolean Save intermediate files. If yes, not only the final BAM, but also the unmerged, unsorted, and duplicates unmarked BAM files will also be saved.
cache_intermediate_pipeline_steps yes boolean Enable cahcing to resume pipeline and the end of the last successful process completion when a pipeline fails (if true the default submission script must be modified).
mark_duplicates no boolean Disable processes which mark duplicates. When false, the pipeline stops at the sorting step, outputting a sorted, indexed, unmerged BAM with unmarked duplicates. Recommended for high coverage targeted panel sequencing datasets. Defaults as true to mark duplicates as usual.
enable_spark yes boolean Enable use of Spark processes. When true, MarkDuplicatesSpark will be used. When false, MarkDuplicates will be used. Default value is true.
spark_temp_dir no path Path to temp dir for Spark processes. When included in the sample config file, Spark intermediate files will be saved to this directory. Defaults to /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively.
spark_metrics no boolean should Spark generate *.mark_dup.metrics
work_dir no path Path of working directory for Nextflow. When included in the sample config file, Nextflow intermediate files and logs will be saved to this directory. With ucla_cds, the default is /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively.
max_number_of_parallel_jobs no int The maximum number of jobs or steps of the pipeline that can be ran in parallel. Default is 1. Be very cautious setting this to any value larger than 1, as it may cause out-of-memory error. It may be helpful when running on a big memory computing node.
bwa_mem_number_of_cpus no int Number of cores to use for BWA-MEM2. If not set, this will be calculated to ensure at least 2.5Gb memory per core.
ucla_cds_registered_dataset_input yes boolean Input FASTQs are from the Boutros Lab data registry.
ucla_cds_registered_dataset_output yes boolean Enable saving final files including BAM and BAM index, and logging files directory to the Boutros Lab Data registry.
dataset_id no string The registered dataset ID of this dataset from the Boutros Lab data registry. Ignored if ucla_cds_registered_data_input = true or ucla_cds_registered_output = false
patient_id no string The registered patient ID of this sample from the Boutros Lab data registry. Ignored if ucla_cds_registered_data_input = true or ucla_cds_registered_output = false
sample_id no string The registered sample ID from the Boutros Lab data registry. Ignored if ucla_cds_registered_data_input = true or ucla_cds_registered_output = false
disable_alt_aware yes boolean Whether to disable the default alt-aware mode for BWA-MEM2. The default behavior of alt-aware mode is to consider the .alt file if it exists in the directory with the reference file.
ucla_cds_data_dir no string The directory where registered data is located. Default: /hot/data
ucla_cds_reference_genome_version no string Identifier for the version of the reference genome version
check_node_config no boolean Whether to check pre-configured node settings used to set CPU and memory constraints. The default behavior, whether true or undefined is to check the pre-configured node settings. Set to false to skip this check.
docker_container_registry no string Registry containing tool Docker images. Default: ghcr.io/uclahs-cds