generateIndex

generateIndex takes the reference genome FASTA, annotation GTF, and the translated proteome FASTA file, and processes them so they can be read by subsequent moPepGen commands quickly. The outputted index files also contain the canonical peptide pool. The index files can then be used in any moPepGen command. It is recommended to run generateIndex before any analysis using moPepGen to avoid processing the reference files repeatedly and save massive time.

Reference Version

The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.

Usage

usage: moPepGen generateIndex [-h] -o <file> [--gtf-symlink] [-f] [-g <file>]
                              [-a <file>]
                              [--reference-source {GENCODE,ENSEMBL}]
                              [-p <file>] [--invalid-protein-as-noncoding]
                              [-c <value>] [--cleavage-exception <value>]
                              [-m <number>] [-w <number>] [-l <number>]
                              [-x <number>] [--debug-level <value|number>]
                              [-q]

Generate genome and proteome index files for moPepGen parsers and peptide
caller.

optional arguments:
  -h, --help            show this help message and exit
  -o <file>, --output-dir <file>
                        Ouput directory for index files. (default: None)
  --gtf-symlink         Create a symlink of the GTF file instead of copying
                        it. (default: False)
  -f, --force           Force write data to index dir. (default: False)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -g <file>, --genome-fasta <file>
                        Path to the genome assembly FASTA file. Only ENSEMBL
                        and GENCODE are supported. Its version must be the
                        same as the annotation GTF and proteome FASTA
                        (default: None)
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)

Cleavage Parameters:
  -c <value>, --cleavage-rule <value>
                        Enzymatic cleavage rule. (default: trypsin)
  --cleavage-exception <value>
                        Enzymatic cleavage exception. (default: auto)
  -m <number>, --miscleavage <number>
                        Number of cleavages to allow per non-canonical
                        peptide. (default: 2)
  -w <number>, --min-mw <number>
                        The minimal molecular weight of the non-canonical
                        peptides. (default: 500.0)
  -l <number>, --min-length <number>
                        The minimal length of non-canonical peptides,
                        inclusive. (default: 7)
  -x <number>, --max-length <number>
                        The maximum length of non-canonical peptides,
                        inclusive. (default: 25)

Arguments

-h, --help

show this help message and exit

-o, --output-dir <file> Path

Ouput directory for index files.

--gtf-symlink

Create a symlink of the GTF file instead of copying it.
Default: False

-f, --force

Force write data to index dir.
Default: False

-g, --genome-fasta <file> Path

Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

-c, --cleavage-rule <value> str

Enzymatic cleavage rule. str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']

--cleavage-exception <value> str

Enzymatic cleavage exception. str
Default: auto

-m, --miscleavage <number> int

Number of cleavages to allow per non-canonical peptide. int
Default: 2

-w, --min-mw <number> float

The minimal molecular weight of the non-canonical peptides. float
Default: 500.0

-l, --min-length <number> int

The minimal length of non-canonical peptides, inclusive. int
Default: 7

-x, --max-length <number> int

The maximum length of non-canonical peptides, inclusive. int
Default: 25

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False

Input

Three files are required for this command:

  1. Reference genome FASTA file.
  2. Genome annotation GTF file.
  3. Protein sequence FASTA file.

All three files must be downloaded from the same release version of either GENCODE or ENSEMBL. moPepGen does not support reference files from other databases (e.g. RefSeq) at the moment.

For GENCODE, the primary assembly is recommended (i.e. 'GRCh38.primary_assembly.genome.fa'). For annotation, we recommend using the comprehensive gene annotation file for the primary assembly, which matches with the genome FASTA (e.g. 'gencode.vXX.primary_assembly.annotation.gtf'). The version-matched protein sequence should also be downloaded from GENCODE (e.g. 'gencode.vXX.pc_transcripts.fa').

Similarly, for ENSEMBL, we recommend using the primary genome assembly and its annotation. The genome FASTA file should resemble 'Homo_sapiens.GRCh38.dna.primary_assembly.fa', the GTF should look like 'Homo_sapiens.GRCh38.XX.chr_patch_hapl_scaff.gtf', and the protein sequence file should look like 'Homo_sapiens.GRCh38.pep.all.fa'.

Output

Users usually don't need to worry about the output files of this command. As long as the correct path is provided to subsequent moPepGen commands, the correct index files will be recognized.

Files are created by this command:

File Name Description
genome.pkl This file contains the entire reference genome.
annotation.gtf A copy of the input annotation GTF file. If --gtf-symlink is used (default = False), this will be a symlink pointing to the input file.
annotation_gene.idx A text file with the location of each gene in the GTF file.
annotation_tx.idx A text file with the location of each transcript in the GTF file.
proteome.pkl Contains all protein sequences.
canonical_peptides_001.pkl Contains nonredundant canonical peptides from the input proteome.
coding_transcripts.pkl Contains all coding transcripts.
metadata.json Metadata including software versions as well as enzymes and cleavage parameters used to generate the canoincal peptide pool.