Config

Field name	Required	Description
`input_csv`	yes	Path to the input CSV file See Input CSV.
`output_dir`	yes	Output directory.
`dataset_id`	yes	Dataset ID.
`sample_id`	yes	Sample ID.
`index_dir`	yes	Path the the genome index directory, generated by `moPepGen generateIndex`. See here for the detail of this command.
`ucla_cds`	no	Whether to use UCLA-CDS' cluster specific configuration. Defaults to `true`.
`save_intermediate_files`	no	Whether to save intermediate files. Defaults to `false`.
`entrypoint`	no	When set to `parser`, it expects to receive raw variant files. When set to `gvf`, it expects to receive GVF files that are already parsed by moPepGen's parsers.
`variant_peptide`	no	Path to the variant peptide FASTA file. Only need when using 'fasta' entrypoint.
`novel_orf_peptide`	no	Noncoding peptide database generated by moPepGen `callNovelORF`, to be split together (default: None)
`alt_translation_peptide`	no	Alternative translation peptide database generated by moPepGen `callAltTranslation`, to be split together (default: None)
`enable_filter_fasta`	no	Whether to run `filterFasta` on the variant, noncoding, and/or the merged FASTA file. Defaults to `false`. If `true` is given, corresponding namespaces must be specified under `params.enable_filter_fasta` according to `database_processing_modes`.
`exprs_table`	no	Gene expression table used to filter variant peptide FASTA. Required when `enable_filter_fasta` is `true`.
`database_processing_modes`	yes	Database postprocessing modes. Must be at least one of 'merge', 'split' and 'plain'. For 'merge', noncoding and variant peptides are merged into one database FASTA. For 'split', noncoding and variant peptides are split into separate database files. For 'plain', the FASTA file output by moPepGen is first filtered (if specified) and then encoded and decoyed. Filter (if specifed), encode and decoy database are done in the same way as 'plain' for 'merge' and 'split'.
`process_unfiltered_fasta`	no	Whether the unfiltered fasta files should be processed (filtering, encode and decoy). Defaults to `true` unless using FASTA entrypoint or `enable_filter_fasta` is `false`.
`enable_encode_fasta`	no	Whether to run `encodeFasta` on the variant peptide FASTA called by `callVariant` （runs once after `filterFasta` and `splitFasta`, if used). Defaults to `false`.
`enable_decoy_fasta`	no	Whether to run `decoyFasta` on the variant peptide FASTA called by `callVariant` (runs once after `filterFasta`, `splitFasta` and `encodeFasta`, if used). Defaults to `false`.

Tool specific namespaces

The variables below are set under tool specific namespaces. See this example config to see how they are set. If the tool is not used, the namespace does not needs to be set. For example, if REDItools results is not included in the input CSV, moPepGen parseREDItools won't be called, so the parseREDItools namespace does not need to be present in the config file.

parseREDItools

Field name	Required	Description
`transcript_id_column`	no	The column index for transcript ID. If your REDItools table doesnot contains it, use the `AnnotateTable.py` from the REDItoolspackage. (default: 16)
`min_coverage_alt`	no	Minimal read coverage of alterations to be parsed. (default: 3)
`min_frequency_alt`	no	Minimal frequency of alteration to be parsed. (default: 0.1)
`min_coverage_dna`	no	Minimal read coverage at the alteration site of WGS. Set it to -1 to skip checking this. (default: 10)

parseSTARFusion

Field name	Required	Description
`min_est_j`	no	Minimal estimated junction reads to be included. (default: 5.0)

parseArriba

Field name	Required	Description
`min_split_read1`	no	Minimal `split_read1` value. (default: 1)
`min_split_read2`	no	Minimal `split_read2` value. (default: 1)
`min_confidence`	no	Minimal confidence value. (default: medium)

parseFusionCatcher

Field name	Required	Description
`max_common_mapping`	no	Maximal number of common mapping reads. (default: 0)
`min_spanning_unique`	no	Minimal spanning unique reads. (default: 5)

parseCIRCexplorer

Field name	Required	Description
`min_read_number`	no	Minimal number of junction read counts. (default: 1)
`min_fpb_circ`	no	Minimal CRICscore value for CIRCexplorer3. Recommends to 1, defaults to None (default: None)
`min_circ_score`	no	Minimal CIRCscore value for CIRCexplorer3. Recommends to 1, defaults to None (default: None)
`intron_start_range`	no	The range of difference allowed between the intron start and the reference position. (default: -2,0)
`intron_end_range`	no	The range of difference allowed between the intron end and the reference position. (default: -100,5)

callVariant

Field name	Required	Description
`max_variants_per_node`	no	Maximal number of variants per node. This argument can be useful when there are local regions that are heavily mutated. When creating the cleavage graph, nodes containing variants larger than this value are skipped. Set to -1 to avoid this check. When multiple values are specified, they will be used as retry stretagy. (default: [7])
`additional_variants_per_misc`	no	Additional variants allowed for every miscleavage. This argument is used together with --max-variants-per-node to handle hypermutated regions. Set to -1 to avoid this check. When multiple values are specified, they will be used as retry stretagy. (default: [2])
`max_adjacent_as_mnv`	no	Max number of adjacent variants that should be merged. (default: 2)
`min_nodes_to_collapse`	no	When making the cleavage graph, the minimal number of nodes to trigger pop collapse. (default: 30)
`naa_to_collapse`	no	The number of bases used for pop collapse. (default: 5)
`selenocysteine-termination`	no	Include peptides of selenoproteins where the UGA is treated as termination instead of Sec.
`w2f_reassignment`	no	Include peptides with W > F (Tryptophan to Phenylalanine) reassignment.
`cleavage_rule`	no	Enzymatic cleavage rule. (default: trypsin)
`miscleavage`	no	Number of cleavages to allow per non-canonical peptide. (default: 2)
`min_mw`	no	The minimal molecular weight of the non-canonical peptides. (default: 500.0)
`min_length`	no	The minimal length of non-canonical peptides, inclusive. (default: 7)
`max_length`	no	The maximum length of non-canonical peptides, inclusive. (default: 25)
`timeout_seconds`	no	Time out in seconds for each transcript. (default: 1800)

filterFasta

Filter fasta can run separately for variant, noncoding, and alternative translation peptide FASTA, so this section can take up to four namespaces, named variant_peptide, novel_orf_peptide and alt_translation_peptide. The parameters allowed in each namespace are listed below. You can set quant_cutoff for variant peptides as 200 and for noncoding peptides as 100. If either namespace is not defined, the corresponding filter won't run.

Field name	Required	Description
`skip_lines`	no	Number of lines to skip when reading the expression table.Defaults to 0 (default: 0)
`delimiter`	no	Delimiter of the expression table. Defaults to tab. (default: '\t')
`tx_id_col`	yes	The index for transcript ID in the RNAseq quantification results. Index is 1-based. (default: None)
`quant_col`	yes	The column index number for quantification. Index is 1-based. (default: None)
`quant_cutoff`	yes	Quantification cutoff. (default: None)
`keep_all_coding`	no	Keep all coding genes, regardless of their expression level. (default: false)
`keep_all_noncoding`	no	Keep all noncoding genes, regardless of their expression level. (default: false)

splitFasta

Field name	Required	Description
`order_source`	no	Order of sources, separate by comma. E.g., SNP,SNV,Fusion (default: None)
`group_source`	no	Group sources. E.g., PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL (default: None)
`max_source_groups`	no	Maximal number of different source groups to be separate intoindividual database FASTA files. Defaults to 1 (default: 1)
`additional_split`	no	For peptides that were not already split into FASTAs up tomax_source_groups, those involving the following source will be splitinto additional FASTAs with decreasing priority (default: None)

summarizeFasta

Field name	Required	Description
`order_source`	no	Order of sources, separate by comma. E.g., SNP,SNV,Fusion (default: None)
`cleavage_rule`	no	Enzymatic cleavage rule. (default: trypsin)
`invalid_protein_as_noncoding`	no	Treat any transcript that the protein sequence is invalid (contains the * symbol) as noncoding. (default: False)

decoyFasta

Field name	Required	Description
`decoy_string`	no	The decoy string that is combined with the FASTA header for decoy sequences. `str` Default: `DECOY_`
`decoy_string_position`	no	Should the decoy string be placed at the start or end of FASTA headers? `str` Default: 'prefix', Choices: ['prefix', 'suffix']
`method`	no	Method to be used to generate the decoy sequences from target sequences. `str`. Default: 'reverse'. Choices: ['reverse', 'shuffle']
`non_shuffle_pattern`	no	Residues to not shuffle and keep at the original position. Separate by common (e.g. "K,R") str
`shuffle_max_attempts`	no	Maximal attempts to shuffle a sequence to avoid any identical decoy sequence. `int` Default: 30
`seed`	no	Random seed number. `int`
`order`	no	Order of target and decoy sequences to write in the output FASTA. `str` Default: 'juxtaposed'. Choices: ['juxtaposed', 'target_first', 'decoy_first']
`keep_peptide_nterm`	Whether to keep the peptide N terminus constant. `str`. Default: 'true' Choices: ['true', 'false']
`keep_peptide_cterm`	no	Whether to keep the peptide C terminus constant. `str` Default: 'true'. Choices: ['true', 'false']