parseVEP
parseVEP
takes the output of Ensembl's Variant Effector Predictor
(VEP) and
convert it into the GVF file format that moPepGen internally uses. The result
VEP file can then be parsed to moPepGen's callVariant
subcommand to call for
variant peptide sequences.
Reference Version
The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.
Usage
usage: moPepGen parseVEP [-h] -i ['<files>'] [['<files>'] ...] -o <file>
--source SOURCE [--skip-failed] [-g <file>]
[-a <file>] [--reference-source {GENCODE,ENSEMBL}]
[--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}]
[--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
[--start-codons [START_CODONS [START_CODONS ...]]]
[--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
[--index-dir [<file>]] [--debug-level <value|number>]
[-q]
Parse VEP output tsv to the GVF format of variant records for moPepGen to call
variant peptides. The genome assembly FASTA and annotation GTF must come from
the same GENCODE/ENSEMBL version, and must the consistent with the VEP output.
optional arguments:
-h, --help show this help message and exit
-i ['<files>'] [['<files>'] ...], --input-path ['<files>'] [['<files>'] ...]
File path to the VEP output TXT file. Can take
multiple files. Valid formats: ['.tsv', '.txt',
'.tsv.gz', '.txt.gz'] (default: None)
-o <file>, --output-path <file>
File path to the output file. Valid formats: ['.gvf']
(default: None)
--source SOURCE Variant source (e.g. gSNP, sSNV, Fusion) (default:
None)
--skip-failed When set, the failed records will be skipped.
(default: False)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Reference Files:
-g <file>, --genome-fasta <file>
Path to the genome assembly FASTA file. Only ENSEMBL
and GENCODE are supported. Its version must be the
same as the annotation GTF and proteome FASTA
(default: None)
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}
Codon table. Defaults to "Standard". Supported codon
tables: {'Alternative Yeast Nuclear', 'Protozoan
Mitochondrial', 'Vertebrate Mitochondrial',
'Blepharisma Macronuclear', 'Chlorophycean
Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
'Trematode Mitochondrial', 'Pachysolen tannophilus
Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
'Euplotid Nuclear', 'Scenedesmus obliquus
Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
'Coelenterate Mitochondrial', 'Bacterial', 'Mold
Mitochondrial', 'SGC3', 'Hexamita Nuclear',
'Pterobranchia Mitochondrial', 'Plant Plastid',
'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
'Gracilibacteria', 'Alternative Flatworm
Mitochondrial', 'Echinoderm Mitochondrial',
'Invertebrate Mitochondrial', 'SGC0', 'Candidate
Division SR1', 'Dasycladacean Nuclear', 'SGC4',
'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
'Standard', 'Karyorelict Nuclear'} (default: Standard)
--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
Chromosome specific codon table. Must be specified in
the format of "chrM:SGC1", where "chrM" is the
chromosome name and "SGC1" is the codon table to use
to translate genes on chrM. Supported codon tables:
{'Alternative Yeast Nuclear', 'Protozoan
Mitochondrial', 'Vertebrate Mitochondrial',
'Blepharisma Macronuclear', 'Chlorophycean
Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
'Trematode Mitochondrial', 'Pachysolen tannophilus
Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
'Euplotid Nuclear', 'Scenedesmus obliquus
Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
'Coelenterate Mitochondrial', 'Bacterial', 'Mold
Mitochondrial', 'SGC3', 'Hexamita Nuclear',
'Pterobranchia Mitochondrial', 'Plant Plastid',
'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
'Gracilibacteria', 'Alternative Flatworm
Mitochondrial', 'Echinoderm Mitochondrial',
'Invertebrate Mitochondrial', 'SGC0', 'Candidate
Division SR1', 'Dasycladacean Nuclear', 'SGC4',
'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
'Standard', 'Karyorelict Nuclear'}. By default, "SGC1"
is assigned to mitochondrial chromosomes. (default:
[])
--start-codons [START_CODONS [START_CODONS ...]]
Default start codon(s) to use for novel ORF
translation. Defaults to ["ATG"]. (default: ['ATG'])
--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
Chromosome specific start codon(s). For example,
"chrM:ATG,ATA,ATT".By defualt, mitochondrial
chromosome name is automatically inferred andstart
codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
assigned to it. (default: [])
--index-dir [<file>] Path to the directory of index files generated by
moPepGen generateIndex. If given, --genome-fasta,
--proteome-fasta and --anntotation-gtf will be
ignored. (default: None)
Arguments
-h, --help
show this help message and exit
-i, --input-path ['<files>'] Path
File path to the VEP output TXT file. Can take multiple files. Valid formats: ['.tsv', '.txt', '.tsv.gz', '.txt.gz']
-o, --output-path <file> Path
File path to the output file. Valid formats: ['.gvf']
--source str
Variant source (e.g. gSNP, sSNV, Fusion)
--skip-failed
When set, the failed records will be skipped.
Default: False
-g, --genome-fasta <file> Path
Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
--codon-table str
Codon table. Defaults to "Standard". Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}
str
Default: Standard
Choices: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}
--chr-codon-table str
Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}. By default, "SGC1" is assigned to mitochondrial chromosomes.
str
Default: []
--start-codons str
Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"].
str
Default: ['ATG']
--chr-start-codons str
Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it.
str
Default: []
--index-dir <file> Path
Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False
Running VEP for moPepGen
moPepGen relies on Ensembl Variant Effect Predictor (VEP) to annotate single nucleotide and small insertion and deletion (indel) variants, typically presented in VCF files. Note that VEP supports the annotation of multi-nucleotide variants and structural variants, but those are not currently supported by moPepGen and will be discarded by parseVEP
.
Please refer to the official VEP Tutorial for instructions on downloading, installing and running VEP. The key is to ensure that the annotation version used in VEP
is the same as the one you would like to use with moPepGen.
parseVEP
currently only supports the TSV output of VEP
, please set the suffix of --output_file
to .tsv
to ensure the correct format is outputted by VEP
. VEP
's default output format is tab-delimited, and the --tab
option adds extra columns to the output that are not used by moPepGen. So including or excluding --tab
will not impact the results of moPepGen.
Using an Ensembl GTF
VEP
automatically uses an Ensembl GTF for annotation. Please ensure that the release version and genome build version is consistent with the references files downloaded. This is available in the name of the cache in output headers (for example, 107_GRCh38
).
One can ensure that the correct version is used by downloading the cache following instructions and specifying the parameters:
--species homo_sapiens --assembly GRCh38 --cache --cache_version 107
Using a Non-Ensembl GTF
If you have decided to use a set of reference files from GENCODE, it is important to supply VEP with the non-Ensembl GTF during annotation, so that chromosome names, transcript IDs and transcript coordinates match with your intended reference files. This can be specified in VEP
.
As instructed by VEP
, your GTF file must be sorted in chromosomal order and indexed.
grep -v "#" PATH_TO_GTF | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > PATH_TO_GTF_GZ
tabix -p gff PATH_TO_GTF_GZ # gtf is not a tabix format option, gff works
To use the GTF for annotation, run VEP with the additional parameters
--custom PATH_TO_GTF,GENCODE,gtf --fasta PATH_TO_GENOME_FA
Followed by VEP, we run the filter_vep
command to subset outputs to only those annotated using the GENCODE GTF, since VEP automatically outputs both native (Ensembl) and custom annotations.
filter_vep -i VEP_OUTPUT_TSV -o FILTERED_VEP_OUTPUT_TSV --filter Source = GENCODE
Tips for Running VEP
We recommend the following parameters for running VEP for run time optimization, please select appropriate settings for your system.
--offline --cache
--no_stats
--fork
--buffer_size
--distance 0
- only annotation variants directly overlapping gene coordinates--chr 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT
- focus on primary assembly--no_intergenic
- ignore intergenic variants
We do not recommend adding any additional annotation flags since the information is not utilized by moPepGen and increases run time.