callVariant

callVariant is the core of moPepGen. It takes multiple GVF files, generated by any moPepGen parser, and calls variant peptides caused by genomic variants using a graph-based algorithm. For any transcript, it creates a three-frame transcript variant graph by incorporating all variants from any sources (SNV, INDEL, fusion, alternative splicing, RNA editing, and circRNA). The transcript variant graph is then translated into a peptide variant graph, followed by converting to a cleavage graph based on the enzymatic cleavage rule. The variant peptide graph is than used to call for variant peptides that contains at least one variant, and do not present in the canonical peptide pool.

Reference

Reference data, incluiding reference genome, genome annotation, and protein coding translation are required. There are two ways of specifying reference data:

Using the index dir created by the generateIndex command.
Specifying each reference files needed.

1 is highly recommended as it is faster and helps you ensure that the same reference data are used across the project.

Codon Table & Start Codon

Codon Table

The NCBI standard codon table is used by default, which is used for the majority of nulcear gene translation in eukaryote cells. See here for a complete list of NCBI codon tables.

The default codon table can be override using --codon-table. For example:

--codon-table ’Ciliate Nuclear'

The --chr-codon-table can be used to specify the codon table used for a specific chomosome. The example below uses the 'Vertebrate Mitochondrial' (SGC1) codon table for genes from the mitochondria chomosome, and the standard codon table otherwise.

--codon-table Standard \
--chr-codon-table 'chrM:SGC1'

Start Codons

Stard codons usually do not need to be specified. The standard start codon ATG is used by default, and it is translated as Methionine as start codon and in elongation. However, in some cases, for example, mitochondria, ATA and ATT may also be used as start codon. While ATT is translatted into Isoleucine during elongation, Methionine is still used as start codon.

Similar to codon table, the default codon table can be override using --start-codons.

--start-codons ATG

The --chr-start-codon can also be used to assign start codons to a specific chomosome. The example below assigns ATG, ATA, and ATT to the mitochondrial chromosome.

--chr-start-codons 'chrM:ATG,ATA,ATT'

Default

The chromosome names must be specified correctly, same as what used in the genome fasta and annotation GTF file. By default, moPepGen infers the reference source of the annotation (i.e., GENCODE or EMSEMBL), and uses the 'SGC1' codon table for mitochondirla chromosome. So the default is equivalent to:

--reference-source GENCODE \
--codon-table Standard \
--chr-codon-table 'chrM:SGC1' \
--start-codons 'ATG' \
--chr-start-codons 'chrM:ATG,ATA,ATT

or

--reference-source ENSEMBL \
--codon-table Standard \
--chr-codon-table 'MT:SGC1' \
--start-codons 'ATG' \
--chr-start-codons 'MT:ATG,ATA,ATT

Usage

usage: moPepGen callVariant [-h] -i ['<files>'] [['<files>'] ...] -o <file>
                            [--graph-output-dir <file>]
                            [--max-adjacent-as-mnv MAX_ADJACENT_AS_MNV]
                            [--selenocysteine-termination]
                            [--w2f-reassignment] [--backsplicing-only]
                            [--coding-novel-orf]
                            [--max-variants-per-node <number> [<number> ...]]
                            [--additional-variants-per-misc <number> [<number> ...]]
                            [--in-bubble-cap-step-down <number>]
                            [--min-nodes-to-collapse <number>]
                            [--naa-to-collapse <number>]
                            [--noncanonical-transcripts]
                            [--timeout-seconds TIMEOUT_SECONDS]
                            [--threads <number>] [--skip-failed] [-g <file>]
                            [-a <file>] [--reference-source {GENCODE,ENSEMBL}]
                            [--codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}]
                            [--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
                            [--start-codons [START_CODONS [START_CODONS ...]]]
                            [--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
                            [-p <file>] [--invalid-protein-as-noncoding]
                            [--index-dir [<file>]] [-c <value>]
                            [--cleavage-exception <value>] [-m <number>]
                            [-w <number>] [-l <number>] [-x <number>]
                            [--debug-level <value|number>] [-q]

Genomic variant data must be generated by one of the moPepGen parser. See
moPepGen --help

optional arguments:
  -h, --help            show this help message and exit
  -i ['<files>'] [['<files>'] ...], --input-path ['<files>'] [['<files>'] ...]
                        File path to GVF files. Must be generated by any of
                        the moPepGen parsers. Can take multiple files. Valid
                        formats: ['.gvf'] (default: None)
  -o <file>, --output-path <file>
                        File path to the output file. Valid formats:
                        ['.fasta', '.fa'] (default: None)
  --graph-output-dir <file>
                        Directory path that graph data are saved to. Graph
                        data are not saved if this is not given. (default:
                        None)
  --max-adjacent-as-mnv MAX_ADJACENT_AS_MNV
                        Max number of adjacent variants that should be merged.
                        (default: 2)
  --selenocysteine-termination
                        Include peptides of selenoprotiens that the UGA is
                        treated as termination instead of Sec. (default:
                        False)
  --w2f-reassignment    Include peptides with W > F (Tryptophan to
                        Phenylalanine) reassignment. (default: False)
  --backsplicing-only   For circRNA, only keep noncanonical peptides spaning
                        the backsplicing site. (default: False)
  --coding-novel-orf    Find alternative start site for coding transcripts.
                        (default: False)
  --max-variants-per-node <number> [<number> ...]
                        Maximal number of variants per node. This argument can
                        be useful when there are local regions that are
                        heavily mutated. When creating the cleavage graph,
                        nodes containing variants larger than this value are
                        skipped. Setting to -1 will avoid checking for this.
                        (default: (7,))
  --additional-variants-per-misc <number> [<number> ...]
                        Additional variants allowed for every miscleavage.
                        This argument is used together with --max-variants-
                        per-node to handle hypermutated regions. Setting to -1
                        will avoid checking for this. (default: (2,))
  --in-bubble-cap-step-down <number>
                        In bubble variant caps default step down. (default: 0)
  --min-nodes-to-collapse <number>
                        When making the cleavage graph, the minimal number of
                        nodes to trigger pop collapse. (default: 30)
  --naa-to-collapse <number>
                        The number of bases used for pop collapse. (default:
                        5)
  --noncanonical-transcripts
                        Process only noncanonical transcripts of fusion
                        transcripts and circRNA. Canonical transcripts are
                        skipped. (default: False)
  --timeout-seconds TIMEOUT_SECONDS
                        Time out in seconds for each transcript. (default:
                        1800)
  --threads <number>    Set number of threads to be used. (default: 1)
  --skip-failed         When set, the failed records will be skipped.
                        (default: False)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -g <file>, --genome-fasta <file>
                        Path to the genome assembly FASTA file. Only ENSEMBL
                        and GENCODE are supported. Its version must be the
                        same as the annotation GTF and proteome FASTA
                        (default: None)
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid
                        Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'} (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear',
                        'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'}. By default, "SGC1" is
                        assigned to mitochondrial chromosomes. (default: [])
  --start-codons [START_CODONS [START_CODONS ...]]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Cleavage Parameters:
  -c <value>, --cleavage-rule <value>
                        Enzymatic cleavage rule. (default: trypsin)
  --cleavage-exception <value>
                        Enzymatic cleavage exception. (default: auto)
  -m <number>, --miscleavage <number>
                        Number of cleavages to allow per non-canonical
                        peptide. (default: 2)
  -w <number>, --min-mw <number>
                        The minimal molecular weight of the non-canonical
                        peptides. (default: 500.0)
  -l <number>, --min-length <number>
                        The minimal length of non-canonical peptides,
                        inclusive. (default: 7)
  -x <number>, --max-length <number>
                        The maximum length of non-canonical peptides,
                        inclusive. (default: 25)

Arguments

-h, --help

show this help message and exit

-i, --input-path ['<files>'] Path

File path to GVF files. Must be generated by any of the moPepGen parsers. Can take multiple files. Valid formats: ['.gvf']

-o, --output-path <file> Path

File path to the output file. Valid formats: ['.fasta', '.fa']

--graph-output-dir <file> Path

Directory path that graph data are saved to. Graph data are not saved if this is not given.

--max-adjacent-as-mnv int

Max number of adjacent variants that should be merged. int
Default: 2

--selenocysteine-termination

Include peptides of selenoprotiens that the UGA is treated as termination instead of Sec.
Default: False

--w2f-reassignment

Include peptides with W > F (Tryptophan to Phenylalanine) reassignment.
Default: False

--backsplicing-only

For circRNA, only keep noncanonical peptides spaning the backsplicing site.
Default: False

--coding-novel-orf

Find alternative start site for coding transcripts.
Default: False

--max-variants-per-node <number> int

Maximal number of variants per node. This argument can be useful when there are local regions that are heavily mutated. When creating the cleavage graph, nodes containing variants larger than this value are skipped. Setting to -1 will avoid checking for this. int
Default: (7,)

--additional-variants-per-misc <number> int

Additional variants allowed for every miscleavage. This argument is used together with --max-variants-per-node to handle hypermutated regions. Setting to -1 will avoid checking for this. int
Default: (2,)

--in-bubble-cap-step-down <number> int

In bubble variant caps default step down. int
Default: 0

--min-nodes-to-collapse <number> int

When making the cleavage graph, the minimal number of nodes to trigger pop collapse. int
Default: 30

--naa-to-collapse <number> int

The number of bases used for pop collapse. int
Default: 5

--noncanonical-transcripts

Process only noncanonical transcripts of fusion transcripts and circRNA. Canonical transcripts are skipped.
Default: False

--timeout-seconds int

Time out in seconds for each transcript. int
Default: 1800

--threads <number> int

Set number of threads to be used. int
Default: 1

--skip-failed

When set, the failed records will be skipped.
Default: False

-g, --genome-fasta <file> Path

Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'} str
Default: Standard
Choices: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

-c, --cleavage-rule <value> str

Enzymatic cleavage rule. str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']

--cleavage-exception <value> str

Enzymatic cleavage exception. str
Default: auto

-m, --miscleavage <number> int

Number of cleavages to allow per non-canonical peptide. int
Default: 2

-w, --min-mw <number> float

The minimal molecular weight of the non-canonical peptides. float
Default: 500.0

-l, --min-length <number> int

The minimal length of non-canonical peptides, inclusive. int
Default: 7

-x, --max-length <number> int

The maximum length of non-canonical peptides, inclusive. int
Default: 25

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False