callVariant
callVariant
is the core of moPepGen. It takes multiple GVF files, generated
by any moPepGen parser, and calls variant peptides caused by genomic variants
using a graph-based algorithm. For any transcript, it creates a three-frame
transcript variant graph by incorporating all variants from any sources (SNV,
INDEL, fusion, alternative splicing, RNA editing, and circRNA). The transcript
variant graph is then translated into a peptide variant graph, followed by
converting to a cleavage graph based on the enzymatic cleavage rule. The
variant peptide graph is than used to call for variant peptides that contains
at least one variant, and do not present in the canonical peptide pool.
Reference
Reference data, incluiding reference genome, genome annotation, and protein coding translation are required. There are two ways of specifying reference data:
- Using the index dir created by the
generateIndex
command. - Specifying each reference files needed.
1 is highly recommended as it is faster and helps you ensure that the same reference data are used across the project.
Codon Table & Start Codon
Codon Table
The NCBI standard codon table is used by default, which is used for the majority of nulcear gene translation in eukaryote cells. See here for a complete list of NCBI codon tables.
The default codon table can be override using --codon-table
. For example:
--codon-table ’Ciliate Nuclear'
The --chr-codon-table
can be used to specify the codon table used for a specific chomosome. The example below uses the 'Vertebrate Mitochondrial' (SGC1) codon table for genes from the mitochondria chomosome, and the standard codon table otherwise.
--codon-table Standard \
--chr-codon-table 'chrM:SGC1'
Start Codons
Stard codons usually do not need to be specified. The standard start codon ATG
is used by default, and it is translated as Methionine as start codon and in elongation. However, in some cases, for example, mitochondria, ATA
and ATT
may also be used as start codon. While ATT
is translatted into Isoleucine during elongation, Methionine is still used as start codon.
Similar to codon table, the default codon table can be override using --start-codons
.
--start-codons ATG
The --chr-start-codon
can also be used to assign start codons to a specific chomosome. The example below assigns ATG
, ATA
, and ATT
to the mitochondrial chromosome.
--chr-start-codons 'chrM:ATG,ATA,ATT'
Default
The chromosome names must be specified correctly, same as what used in the genome fasta and annotation GTF file. By default, moPepGen infers the reference source of the annotation (i.e., GENCODE or EMSEMBL), and uses the 'SGC1' codon table for mitochondirla chromosome. So the default is equivalent to:
--reference-source GENCODE \
--codon-table Standard \
--chr-codon-table 'chrM:SGC1' \
--start-codons 'ATG' \
--chr-start-codongs 'chrM:ATG,ATA,ATT
or
--reference-source ENSEMBL \
--codon-table Standard \
--chr-codon-table 'MT:SGC1' \
--start-codons 'ATG' \
--chr-start-codongs 'MT:ATG,ATA,ATT
Usage
usage: moPepGen callVariant [-h] -i ['<files>'] [['<files>'] ...] -o <file>
[--graph-output-dir <file>]
[--max-adjacent-as-mnv MAX_ADJACENT_AS_MNV]
[--selenocysteine-termination]
[--w2f-reassignment] [--backsplicing-only]
[--coding-novel-orf]
[--max-variants-per-node <number> [<number> ...]]
[--additional-variants-per-misc <number> [<number> ...]]
[--in-bubble-cap-step-down <number>]
[--min-nodes-to-collapse <number>]
[--naa-to-collapse <number>]
[--noncanonical-transcripts]
[--timeout-seconds TIMEOUT_SECONDS]
[--threads <number>] [--skip-failed] [-g <file>]
[-a <file>] [--reference-source {GENCODE,ENSEMBL}]
[--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}]
[--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
[--start-codons [START_CODONS [START_CODONS ...]]]
[--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
[-p <file>] [--invalid-protein-as-noncoding]
[--index-dir [<file>]] [-c <value>]
[--cleavage-exception <value>] [-m <number>]
[-w <number>] [-l <number>] [-x <number>]
[--debug-level <value|number>] [-q]
Genomic variant data must be generated by one of the moPepGen parser. See
moPepGen --help
optional arguments:
-h, --help show this help message and exit
-i ['<files>'] [['<files>'] ...], --input-path ['<files>'] [['<files>'] ...]
File path to GVF files. Must be generated by any of
the moPepGen parsers. Can take multiple files. Valid
formats: ['.gvf'] (default: None)
-o <file>, --output-path <file>
File path to the output file. Valid formats:
['.fasta', '.fa'] (default: None)
--graph-output-dir <file>
Directory path that graph data are saved to. Graph
data are not saved if this is not given. (default:
None)
--max-adjacent-as-mnv MAX_ADJACENT_AS_MNV
Max number of adjacent variants that should be merged.
(default: 2)
--selenocysteine-termination
Include peptides of selenoprotiens that the UGA is
treated as termination instead of Sec. (default:
False)
--w2f-reassignment Include peptides with W > F (Tryptophan to
Phenylalanine) reassignment. (default: False)
--backsplicing-only For circRNA, only keep noncanonical peptides spaning
the backsplicing site. (default: False)
--coding-novel-orf Find alternative start site for coding transcripts.
(default: False)
--max-variants-per-node <number> [<number> ...]
Maximal number of variants per node. This argument can
be useful when there are local regions that are
heavily mutated. When creating the cleavage graph,
nodes containing variants larger than this value are
skipped. Setting to -1 will avoid checking for this.
(default: (7,))
--additional-variants-per-misc <number> [<number> ...]
Additional variants allowed for every miscleavage.
This argument is used together with --max-variants-
per-node to handle hypermutated regions. Setting to -1
will avoid checking for this. (default: (2,))
--in-bubble-cap-step-down <number>
In bubble variant caps default step down. (default: 0)
--min-nodes-to-collapse <number>
When making the cleavage graph, the minimal number of
nodes to trigger pop collapse. (default: 30)
--naa-to-collapse <number>
The number of bases used for pop collapse. (default:
5)
--noncanonical-transcripts
Process only noncanonical transcripts of fusion
transcripts and circRNA. Canonical transcripts are
skipped. (default: False)
--timeout-seconds TIMEOUT_SECONDS
Time out in seconds for each transcript. (default:
1800)
--threads <number> Set number of threads to be used. (default: 1)
--skip-failed When set, the failed records will be skipped.
(default: False)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Reference Files:
-g <file>, --genome-fasta <file>
Path to the genome assembly FASTA file. Only ENSEMBL
and GENCODE are supported. Its version must be the
same as the annotation GTF and proteome FASTA
(default: None)
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}
Codon table. Defaults to "Standard". Supported codon
tables: {'Alternative Yeast Nuclear', 'Protozoan
Mitochondrial', 'Vertebrate Mitochondrial',
'Blepharisma Macronuclear', 'Chlorophycean
Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
'Trematode Mitochondrial', 'Pachysolen tannophilus
Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
'Euplotid Nuclear', 'Scenedesmus obliquus
Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
'Coelenterate Mitochondrial', 'Bacterial', 'Mold
Mitochondrial', 'SGC3', 'Hexamita Nuclear',
'Pterobranchia Mitochondrial', 'Plant Plastid',
'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
'Gracilibacteria', 'Alternative Flatworm
Mitochondrial', 'Echinoderm Mitochondrial',
'Invertebrate Mitochondrial', 'SGC0', 'Candidate
Division SR1', 'Dasycladacean Nuclear', 'SGC4',
'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
'Standard', 'Karyorelict Nuclear'} (default: Standard)
--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
Chromosome specific codon table. Must be specified in
the format of "chrM:SGC1", where "chrM" is the
chromosome name and "SGC1" is the codon table to use
to translate genes on chrM. Supported codon tables:
{'Alternative Yeast Nuclear', 'Protozoan
Mitochondrial', 'Vertebrate Mitochondrial',
'Blepharisma Macronuclear', 'Chlorophycean
Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
'Trematode Mitochondrial', 'Pachysolen tannophilus
Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
'Euplotid Nuclear', 'Scenedesmus obliquus
Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
'Coelenterate Mitochondrial', 'Bacterial', 'Mold
Mitochondrial', 'SGC3', 'Hexamita Nuclear',
'Pterobranchia Mitochondrial', 'Plant Plastid',
'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
'Gracilibacteria', 'Alternative Flatworm
Mitochondrial', 'Echinoderm Mitochondrial',
'Invertebrate Mitochondrial', 'SGC0', 'Candidate
Division SR1', 'Dasycladacean Nuclear', 'SGC4',
'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
'Standard', 'Karyorelict Nuclear'}. By default, "SGC1"
is assigned to mitochondrial chromosomes. (default:
[])
--start-codons [START_CODONS [START_CODONS ...]]
Default start codon(s) to use for novel ORF
translation. Defaults to ["ATG"]. (default: ['ATG'])
--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
Chromosome specific start codon(s). For example,
"chrM:ATG,ATA,ATT".By defualt, mitochondrial
chromosome name is automatically inferred andstart
codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
assigned to it. (default: [])
-p <file>, --proteome-fasta <file>
Path to the translated protein sequence FASTA file.
Only ENSEMBL and GENCODE are supported. Its version
must be the same as genome FASTA and annotation GTF.
(default: None)
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is
invalid ( contains the * symbol) as noncoding.
(default: False)
--index-dir [<file>] Path to the directory of index files generated by
moPepGen generateIndex. If given, --genome-fasta,
--proteome-fasta and --anntotation-gtf will be
ignored. (default: None)
Cleavage Parameters:
-c <value>, --cleavage-rule <value>
Enzymatic cleavage rule. (default: trypsin)
--cleavage-exception <value>
Enzymatic cleavage exception. (default: auto)
-m <number>, --miscleavage <number>
Number of cleavages to allow per non-canonical
peptide. (default: 2)
-w <number>, --min-mw <number>
The minimal molecular weight of the non-canonical
peptides. (default: 500.0)
-l <number>, --min-length <number>
The minimal length of non-canonical peptides,
inclusive. (default: 7)
-x <number>, --max-length <number>
The maximum length of non-canonical peptides,
inclusive. (default: 25)
Arguments
-h, --help
show this help message and exit
-i, --input-path ['<files>'] Path
File path to GVF files. Must be generated by any of the moPepGen parsers. Can take multiple files. Valid formats: ['.gvf']
-o, --output-path <file> Path
File path to the output file. Valid formats: ['.fasta', '.fa']
--graph-output-dir <file> Path
Directory path that graph data are saved to. Graph data are not saved if this is not given.
--max-adjacent-as-mnv int
Max number of adjacent variants that should be merged.
int
Default: 2
--selenocysteine-termination
Include peptides of selenoprotiens that the UGA is treated as termination instead of Sec.
Default: False
--w2f-reassignment
Include peptides with W > F (Tryptophan to Phenylalanine) reassignment.
Default: False
--backsplicing-only
For circRNA, only keep noncanonical peptides spaning the backsplicing site.
Default: False
--coding-novel-orf
Find alternative start site for coding transcripts.
Default: False
--max-variants-per-node <number> int
Maximal number of variants per node. This argument can be useful when there are local regions that are heavily mutated. When creating the cleavage graph, nodes containing variants larger than this value are skipped. Setting to -1 will avoid checking for this.
int
Default: (7,)
--additional-variants-per-misc <number> int
Additional variants allowed for every miscleavage. This argument is used together with --max-variants-per-node to handle hypermutated regions. Setting to -1 will avoid checking for this.
int
Default: (2,)
--in-bubble-cap-step-down <number> int
In bubble variant caps default step down.
int
Default: 0
--min-nodes-to-collapse <number> int
When making the cleavage graph, the minimal number of nodes to trigger pop collapse.
int
Default: 30
--naa-to-collapse <number> int
The number of bases used for pop collapse.
int
Default: 5
--noncanonical-transcripts
Process only noncanonical transcripts of fusion transcripts and circRNA. Canonical transcripts are skipped.
Default: False
--timeout-seconds int
Time out in seconds for each transcript.
int
Default: 1800
--threads <number> int
Set number of threads to be used.
int
Default: 1
--skip-failed
When set, the failed records will be skipped.
Default: False
-g, --genome-fasta <file> Path
Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
--codon-table str
Codon table. Defaults to "Standard". Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}
str
Default: Standard
Choices: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}
--chr-codon-table str
Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}. By default, "SGC1" is assigned to mitochondrial chromosomes.
str
Default: []
--start-codons str
Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"].
str
Default: ['ATG']
--chr-start-codons str
Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it.
str
Default: []
-p, --proteome-fasta <file> Path
Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False
--index-dir <file> Path
Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.
-c, --cleavage-rule <value> str
Enzymatic cleavage rule.
str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']
--cleavage-exception <value> str
Enzymatic cleavage exception.
str
Default: auto
-m, --miscleavage <number> int
Number of cleavages to allow per non-canonical peptide.
int
Default: 2
-w, --min-mw <number> float
The minimal molecular weight of the non-canonical peptides.
float
Default: 500.0
-l, --min-length <number> int
The minimal length of non-canonical peptides, inclusive.
int
Default: 7
-x, --max-length <number> int
The maximum length of non-canonical peptides, inclusive.
int
Default: 25
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False