generateIndex
generateIndex
takes the reference genome FASTA, annotation GTF, and the
translated proteome FASTA file, and processes them so they can be read by
subsequent moPepGen commands quickly. The outputted index files also contain the
canonical peptide pool. The index files can then be used in any moPepGen
command. It is recommended to run generateIndex
before any analysis using
moPepGen to avoid processing the reference files repeatedly and save massive
time.
Reference Version
The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.
Usage
usage: moPepGen generateIndex [-h] -o <file> [--gtf-symlink] [-f] [-g <file>]
[-a <file>]
[--reference-source {GENCODE,ENSEMBL}]
[-p <file>] [--invalid-protein-as-noncoding]
[-c <value>] [--cleavage-exception <value>]
[-m <number>] [-w <number>] [-l <number>]
[-x <number>] [--debug-level <value|number>]
[-q]
Generate genome and proteome index files for moPepGen parsers and peptide
caller.
optional arguments:
-h, --help show this help message and exit
-o <file>, --output-dir <file>
Ouput directory for index files. (default: None)
--gtf-symlink Create a symlink of the GTF file instead of copying
it. (default: False)
-f, --force Force write data to index dir. (default: False)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Reference Files:
-g <file>, --genome-fasta <file>
Path to the genome assembly FASTA file. Only ENSEMBL
and GENCODE are supported. Its version must be the
same as the annotation GTF and proteome FASTA
(default: None)
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
-p <file>, --proteome-fasta <file>
Path to the translated protein sequence FASTA file.
Only ENSEMBL and GENCODE are supported. Its version
must be the same as genome FASTA and annotation GTF.
(default: None)
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is
invalid ( contains the * symbol) as noncoding.
(default: False)
Cleavage Parameters:
-c <value>, --cleavage-rule <value>
Enzymatic cleavage rule. (default: trypsin)
--cleavage-exception <value>
Enzymatic cleavage exception. (default: auto)
-m <number>, --miscleavage <number>
Number of cleavages to allow per non-canonical
peptide. (default: 2)
-w <number>, --min-mw <number>
The minimal molecular weight of the non-canonical
peptides. (default: 500.0)
-l <number>, --min-length <number>
The minimal length of non-canonical peptides,
inclusive. (default: 7)
-x <number>, --max-length <number>
The maximum length of non-canonical peptides,
inclusive. (default: 25)
Arguments
-h, --help
show this help message and exit
-o, --output-dir <file> Path
Ouput directory for index files.
--gtf-symlink
Create a symlink of the GTF file instead of copying it.
Default: False
-f, --force
Force write data to index dir.
Default: False
-g, --genome-fasta <file> Path
Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
-p, --proteome-fasta <file> Path
Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False
-c, --cleavage-rule <value> str
Enzymatic cleavage rule.
str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']
--cleavage-exception <value> str
Enzymatic cleavage exception.
str
Default: auto
-m, --miscleavage <number> int
Number of cleavages to allow per non-canonical peptide.
int
Default: 2
-w, --min-mw <number> float
The minimal molecular weight of the non-canonical peptides.
float
Default: 500.0
-l, --min-length <number> int
The minimal length of non-canonical peptides, inclusive.
int
Default: 7
-x, --max-length <number> int
The maximum length of non-canonical peptides, inclusive.
int
Default: 25
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False
Input
Three files are required for this command:
- Reference genome FASTA file.
- Genome annotation GTF file.
- Protein sequence FASTA file.
All three files must be downloaded from the same release version of either GENCODE or ENSEMBL. moPepGen does not support reference files from other databases (e.g. RefSeq) at the moment.
For GENCODE, the primary assembly is recommended (i.e. 'GRCh38.primary_assembly.genome.fa'). For annotation, we recommend using the comprehensive gene annotation file for the primary assembly, which matches with the genome FASTA (e.g. 'gencode.vXX.primary_assembly.annotation.gtf'). The version-matched protein sequence should also be downloaded from GENCODE (e.g. 'gencode.vXX.pc_transcripts.fa').
Similarly, for ENSEMBL, we recommend using the primary genome assembly and its annotation. The genome FASTA file should resemble 'Homo_sapiens.GRCh38.dna.primary_assembly.fa', the GTF should look like 'Homo_sapiens.GRCh38.XX.chr_patch_hapl_scaff.gtf', and the protein sequence file should look like 'Homo_sapiens.GRCh38.pep.all.fa'.
Output
Users usually don't need to worry about the output files of this command. As long as the correct path is provided to subsequent moPepGen commands, the correct index files will be recognized.
Files are created by this command:
File Name | Description |
---|---|
genome.pkl |
This file contains the entire reference genome. |
annotation.gtf |
A copy of the input annotation GTF file. If --gtf-symlink is used (default = False ), this will be a symlink pointing to the input file. |
annotation_gene.idx |
A text file with the location of each gene in the GTF file. |
annotation_tx.idx |
A text file with the location of each transcript in the GTF file. |
proteome.pkl |
Contains all protein sequences. |
canonical_peptides_001.pkl |
Contains nonredundant canonical peptides from the input proteome. |
coding_transcripts.pkl |
Contains all coding transcripts. |
metadata.json |
Metadata including software versions as well as enzymes and cleavage parameters used to generate the canoincal peptide pool. |