callNovelORF

callNovelORF calls noncanonical peptide sequences from novel ORFs. It finds all start codons of any novel ORF gene.

Reference Version

The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.

Usage

usage: moPepGen callNovelORF [-h] -o <file> [--output-orf <file>]
                             [--min-tx-length <number>]
                             [--orf-assignment <choice>] [--coding-novel-orf]
                             [--w2f-reassignment]
                             [--inclusion-biotypes <file>]
                             [--exclusion-biotypes <file>] [-g <file>]
                             [-a <file>]
                             [--reference-source {GENCODE,ENSEMBL}]
                             [-p <file>] [--invalid-protein-as-noncoding]
                             [--index-dir [<file>]] [-c <value>]
                             [--cleavage-exception <value>] [-m <number>]
                             [-w <number>] [-l <number>] [-x <number>]
                             [--debug-level <value|number>] [-q]

optional arguments:
  -h, --help            show this help message and exit
  -o <file>, --output-path <file>
                        Output path to the novel ORF peptide FASTA. Valid
                        formats: ['.fa', '.fasta'] (default: None)
  --output-orf <file>   Output path to the FASTA file with novel ORF
                        sequences. Valid formats: ['.fa', '.fasta'] (default:
                        None)
  --min-tx-length <number>
                        Minimal transcript length. (default: 21)
  --orf-assignment <choice>
                        Defines how ORF assignment should be done. The last
                        ORF upstream to the peptide is used for `max` and the
                        first (most upstream) one is used for `min` (default:
                        max)
  --coding-novel-orf    Include coding transcripts to find alternative ORFs.
                        (default: False)
  --w2f-reassignment    Include peptides with W > F (Tryptophan to
                        Phenylalanine) reassignment. (default: False)
  --inclusion-biotypes <file>
                        Inclusion biotype list. (default: None)
  --exclusion-biotypes <file>
                        Exclusion biotype list. (default: None)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -g <file>, --genome-fasta <file>
                        Path to the genome assembly FASTA file. Only ENSEMBL
                        and GENCODE are supported. Its version must be the
                        same as the annotation GTF and proteome FASTA
                        (default: None)
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Cleavage Parameters:
  -c <value>, --cleavage-rule <value>
                        Enzymatic cleavage rule. (default: trypsin)
  --cleavage-exception <value>
                        Enzymatic cleavage exception. (default: auto)
  -m <number>, --miscleavage <number>
                        Number of cleavages to allow per non-canonical
                        peptide. (default: 2)
  -w <number>, --min-mw <number>
                        The minimal molecular weight of the non-canonical
                        peptides. (default: 500.0)
  -l <number>, --min-length <number>
                        The minimal length of non-canonical peptides,
                        inclusive. (default: 7)
  -x <number>, --max-length <number>
                        The maximum length of non-canonical peptides,
                        inclusive. (default: 25)

Arguments

-h, --help

show this help message and exit

-o, --output-path <file> Path

Output path to the novel ORF peptide FASTA. Valid formats: ['.fa', '.fasta']

--output-orf <file> Path

Output path to the FASTA file with novel ORF sequences. Valid formats: ['.fa', '.fasta']

--min-tx-length <number> int

Minimal transcript length. int
Default: 21

--orf-assignment <choice> str

Defines how ORF assignment should be done. The last ORF upstream to the peptide is used for `max` and the first (most upstream) one is used for `min` str
Default: max
Choices: ['max', 'min']

--coding-novel-orf

Include coding transcripts to find alternative ORFs.
Default: False

--w2f-reassignment

Include peptides with W > F (Tryptophan to Phenylalanine) reassignment.
Default: False

--inclusion-biotypes <file> Path

Inclusion biotype list.

--exclusion-biotypes <file> Path

Exclusion biotype list.

-g, --genome-fasta <file> Path

Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

-c, --cleavage-rule <value> str

Enzymatic cleavage rule. str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']

--cleavage-exception <value> str

Enzymatic cleavage exception. str
Default: auto

-m, --miscleavage <number> int

Number of cleavages to allow per non-canonical peptide. int
Default: 2

-w, --min-mw <number> float

The minimal molecular weight of the non-canonical peptides. float
Default: 500.0

-l, --min-length <number> int

The minimal length of non-canonical peptides, inclusive. int
Default: 7

-x, --max-length <number> int

The maximum length of non-canonical peptides, inclusive. int
Default: 25

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False