parseVEP

parseVEP takes the output of Ensembl's Variant Effector Predictor (VEP) and convert it into the GVF file format that moPepGen internally uses. The result VEP file can then be parsed to moPepGen's callVariant subcommand to call for variant peptide sequences.

Reference Version

The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.

Usage

usage: moPepGen parseVEP [-h] -i ['<files>'] [['<files>'] ...] [-o <file>]
                         [--output-prefix <value>]
                         [--samples SAMPLES [SAMPLES ...]] --source SOURCE
                         [--skip-failed] [-g <file>] [-a <file>]
                         [--reference-source {GENCODE,ENSEMBL}]
                         [--codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}]
                         [--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
                         [--start-codons [START_CODONS [START_CODONS ...]]]
                         [--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
                         [--index-dir [<file>]] [--debug-level <value|number>]
                         [-q]

Parse VEP output tsv to the GVF format of variant records for moPepGen to call
variant peptides. The genome assembly FASTA and annotation GTF must come from
the same GENCODE/ENSEMBL version, and must the consistent with the VEP output.

optional arguments:
  -h, --help            show this help message and exit
  -i ['<files>'] [['<files>'] ...], --input-path ['<files>'] [['<files>'] ...]
                        File path to the VEP output TXT file. Can take
                        multiple files. Valid formats: ['.tsv', '.txt',
                        '.tsv.gz', '.txt.gz', '.vcf', '.vcf.gz'] (default:
                        None)
  -o <file>, --output-path <file>
                        File path to the output file. Valid formats: ['.gvf']
                        (default: None)
  --output-prefix <value>
                        Output prefix. Only used when inputs are VCF files.
                        (default: None)
  --samples SAMPLES [SAMPLES ...]
                        Samples to be parsed from the VCF file. If not
                        provided, all samples from the VCF file will be
                        parsed. This option is only used when the inputs are
                        VCF files. (default: [])
  --source SOURCE       Variant source (e.g. gSNP, sSNV, Fusion) (default:
                        None)
  --skip-failed         When set, the failed records will be skipped.
                        (default: False)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -g <file>, --genome-fasta <file>
                        Path to the genome assembly FASTA file. Only ENSEMBL
                        and GENCODE are supported. Its version must be the
                        same as the annotation GTF and proteome FASTA
                        (default: None)
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid
                        Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'} (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear',
                        'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'}. By default, "SGC1" is
                        assigned to mitochondrial chromosomes. (default: [])
  --start-codons [START_CODONS [START_CODONS ...]]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Arguments

-h, --help

show this help message and exit

-i, --input-path ['<files>'] Path

File path to the VEP output TXT file. Can take multiple files. Valid formats: ['.tsv', '.txt', '.tsv.gz', '.txt.gz', '.vcf', '.vcf.gz']

-o, --output-path <file> Path

File path to the output file. Valid formats: ['.gvf']

--output-prefix <value> Path

Output prefix. Only used when inputs are VCF files.

--samples str

Samples to be parsed from the VCF file. If not provided, all samples from the VCF file will be parsed. This option is only used when the inputs are VCF files. str
Default: []

--source str

Variant source (e.g. gSNP, sSNV, Fusion)

--skip-failed

When set, the failed records will be skipped.
Default: False

-g, --genome-fasta <file> Path

Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'} str
Default: Standard
Choices: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False

Running VEP for moPepGen

moPepGen relies on Ensembl Variant Effect Predictor (VEP) to annotate single nucleotide and small insertion and deletion (indel) variants, typically presented in VCF files. Note that VEP supports the annotation of multi-nucleotide variants and structural variants, but those are not currently supported by moPepGen and will be discarded by parseVEP.

Please refer to the official VEP Tutorial for instructions on downloading, installing and running VEP. The key is to ensure that the annotation version used in VEP is the same as the one you would like to use with moPepGen.

parseVEP supports both the TSV and VCF format of output of VEP. For TSV, VEP's default is tab-delimited, and the --tab option adds extra columns to the output that are not used by moPepGen. So including or excluding --tab will not impact the results of moPepGen. As for VCF, the CSQ info field is parsed, and the default format is expected, containing the following fields:

Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID

When using VCF format, the --samples argument can be used to specify the columns you would like to parse, and the variant records will be split into sample-specific GVF files. By default, all sample columns will be processed.

Using an Ensembl GTF

VEP automatically uses an Ensembl GTF for annotation. Please ensure that the release version and genome build version is consistent with the references files downloaded. This is available in the name of the cache in output headers (for example, 107_GRCh38).

One can ensure that the correct version is used by downloading the cache following instructions and specifying the parameters: --species homo_sapiens --assembly GRCh38 --cache --cache_version 107

Using a Non-Ensembl GTF

If you have decided to use a set of reference files from GENCODE, it is important to supply VEP with the non-Ensembl GTF during annotation, so that chromosome names, transcript IDs and transcript coordinates match with your intended reference files. This can be specified in VEP.

As instructed by VEP, your GTF file must be sorted in chromosomal order and indexed.

grep -v "#" PATH_TO_GTF | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > PATH_TO_GTF_GZ
tabix -p gff PATH_TO_GTF_GZ # gtf is not a tabix format option, gff works

To use the GTF for annotation, run VEP with the additional parameters

--custom PATH_TO_GTF,GENCODE,gtf --fasta PATH_TO_GENOME_FA

Followed by VEP, we run the filter_vep command to subset outputs to only those annotated using the GENCODE GTF, since VEP automatically outputs both native (Ensembl) and custom annotations.

filter_vep -i VEP_OUTPUT_TSV -o FILTERED_VEP_OUTPUT_TSV --filter Source = GENCODE

Tips for Running VEP

We recommend the following parameters for running VEP for run time optimization, please select appropriate settings for your system.

  • --offline --cache
  • --no_stats
  • --fork
  • --buffer_size
  • --distance 0 - only annotation variants directly overlapping gene coordinates
  • --chr 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT - focus on primary assembly
  • --no_intergenic - ignore intergenic variants

We do not recommend adding any additional annotation flags since the information is not utilized by moPepGen and increases run time.