generateIndex

generateIndex takes the reference genome FASTA, annotation GTF, and the translated proteome FASTA file, and processes them so they can be read by subsequent moPepGen commands quickly. The outputted index files also contain the canonical peptide pool. The index files can then be used in any moPepGen command. It is recommended to run generateIndex before any analysis using moPepGen to avoid processing the reference files repeatedly and save massive time.

Reference Version

The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.

Usage

usage: moPepGen generateIndex [-h] -o <file> [--gtf-symlink] [-f] [-g <file>]
                              [-a <file>]
                              [--reference-source {GENCODE,ENSEMBL}]
                              [--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}]
                              [--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
                              [--start-codons [START_CODONS [START_CODONS ...]]]
                              [--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
                              [-p <file>] [--invalid-protein-as-noncoding]
                              [-c <value>] [--cleavage-exception <value>]
                              [-m <number>] [-w <number>] [-l <number>]
                              [-x <number>] [--debug-level <value|number>]
                              [-q]

Generate genome and proteome index files for moPepGen parsers and peptide
caller.

optional arguments:
  -h, --help            show this help message and exit
  -o <file>, --output-dir <file>
                        Ouput directory for index files. (default: None)
  --gtf-symlink         Create a symlink of the GTF file instead of copying
                        it. (default: False)
  -f, --force           Force write data to index dir. (default: False)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -g <file>, --genome-fasta <file>
                        Path to the genome assembly FASTA file. Only ENSEMBL
                        and GENCODE are supported. Its version must be the
                        same as the annotation GTF and proteome FASTA
                        (default: None)
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'Alternative Yeast Nuclear', 'Protozoan
                        Mitochondrial', 'Vertebrate Mitochondrial',
                        'Blepharisma Macronuclear', 'Chlorophycean
                        Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
                        Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
                        Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
                        'Trematode Mitochondrial', 'Pachysolen tannophilus
                        Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
                        'Euplotid Nuclear', 'Scenedesmus obliquus
                        Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
                        'Coelenterate Mitochondrial', 'Bacterial', 'Mold
                        Mitochondrial', 'SGC3', 'Hexamita Nuclear',
                        'Pterobranchia Mitochondrial', 'Plant Plastid',
                        'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
                        'Gracilibacteria', 'Alternative Flatworm
                        Mitochondrial', 'Echinoderm Mitochondrial',
                        'Invertebrate Mitochondrial', 'SGC0', 'Candidate
                        Division SR1', 'Dasycladacean Nuclear', 'SGC4',
                        'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
                        Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
                        'Standard', 'Karyorelict Nuclear'} (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'Alternative Yeast Nuclear', 'Protozoan
                        Mitochondrial', 'Vertebrate Mitochondrial',
                        'Blepharisma Macronuclear', 'Chlorophycean
                        Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
                        Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
                        Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
                        'Trematode Mitochondrial', 'Pachysolen tannophilus
                        Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
                        'Euplotid Nuclear', 'Scenedesmus obliquus
                        Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
                        'Coelenterate Mitochondrial', 'Bacterial', 'Mold
                        Mitochondrial', 'SGC3', 'Hexamita Nuclear',
                        'Pterobranchia Mitochondrial', 'Plant Plastid',
                        'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
                        'Gracilibacteria', 'Alternative Flatworm
                        Mitochondrial', 'Echinoderm Mitochondrial',
                        'Invertebrate Mitochondrial', 'SGC0', 'Candidate
                        Division SR1', 'Dasycladacean Nuclear', 'SGC4',
                        'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
                        Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
                        'Standard', 'Karyorelict Nuclear'}. By default, "SGC1"
                        is assigned to mitochondrial chromosomes. (default:
                        [])
  --start-codons [START_CODONS [START_CODONS ...]]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)

Cleavage Parameters:
  -c <value>, --cleavage-rule <value>
                        Enzymatic cleavage rule. (default: trypsin)
  --cleavage-exception <value>
                        Enzymatic cleavage exception. (default: auto)
  -m <number>, --miscleavage <number>
                        Number of cleavages to allow per non-canonical
                        peptide. (default: 2)
  -w <number>, --min-mw <number>
                        The minimal molecular weight of the non-canonical
                        peptides. (default: 500.0)
  -l <number>, --min-length <number>
                        The minimal length of non-canonical peptides,
                        inclusive. (default: 7)
  -x <number>, --max-length <number>
                        The maximum length of non-canonical peptides,
                        inclusive. (default: 25)

Arguments

-h, --help

show this help message and exit

-o, --output-dir <file> Path

Ouput directory for index files.

--gtf-symlink

Create a symlink of the GTF file instead of copying it.
Default: False

-f, --force

Force write data to index dir.
Default: False

-g, --genome-fasta <file> Path

Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'} str
Default: Standard
Choices: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

-c, --cleavage-rule <value> str

Enzymatic cleavage rule. str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']

--cleavage-exception <value> str

Enzymatic cleavage exception. str
Default: auto

-m, --miscleavage <number> int

Number of cleavages to allow per non-canonical peptide. int
Default: 2

-w, --min-mw <number> float

The minimal molecular weight of the non-canonical peptides. float
Default: 500.0

-l, --min-length <number> int

The minimal length of non-canonical peptides, inclusive. int
Default: 7

-x, --max-length <number> int

The maximum length of non-canonical peptides, inclusive. int
Default: 25

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False

Input

Three files are required for this command:

  1. Reference genome FASTA file.
  2. Genome annotation GTF file.
  3. Protein sequence FASTA file.

All three files must be downloaded from the same release version of either GENCODE or ENSEMBL. moPepGen does not support reference files from other databases (e.g. RefSeq) at the moment.

For GENCODE, the primary assembly is recommended (i.e. 'GRCh38.primary_assembly.genome.fa'). For annotation, we recommend using the comprehensive gene annotation file for the primary assembly, which matches with the genome FASTA (e.g. 'gencode.vXX.primary_assembly.annotation.gtf'). The version-matched protein sequence should also be downloaded from GENCODE (e.g. 'gencode.vXX.pc_transcripts.fa').

Similarly, for ENSEMBL, we recommend using the primary genome assembly and its annotation. The genome FASTA file should resemble 'Homo_sapiens.GRCh38.dna.primary_assembly.fa', the GTF should look like 'Homo_sapiens.GRCh38.XX.chr_patch_hapl_scaff.gtf', and the protein sequence file should look like 'Homo_sapiens.GRCh38.pep.all.fa'.

Output

Users usually don't need to worry about the output files of this command. As long as the correct path is provided to subsequent moPepGen commands, the correct index files will be recognized.

Files are created by this command:

File Name Description
genome.pkl This file contains the entire reference genome.
annotation.gtf A copy of the input annotation GTF file. If --gtf-symlink is used (default = False), this will be a symlink pointing to the input file.
annotation_gene.idx A text file with the location of each gene in the GTF file.
annotation_tx.idx A text file with the location of each transcript in the GTF file.
proteome.pkl Contains all protein sequences.
canonical_peptides_001.pkl Contains nonredundant canonical peptides from the input proteome.
coding_transcripts.pkl Contains all coding transcripts.
metadata.json Metadata including software versions as well as enzymes and cleavage parameters used to generate the canoincal peptide pool.