Vignette

Welcome to the vignette for moPepGen, a powerful Python package designed for generating custom proteogenomic peptide databases for proteomics library search. This vignette aims to provide you with a comprehensive overview and step-by-step guide on utilizing moPepGen to create custom databases for non-canonical peptide detection in mass spectrometry-based proteomics experiments. A small demo dataset has been provided, and each step should take no more than one minute to run on any computer.

Installation

moPepGen is a command line tool designated to execute in a Unix-like environment. For MacOS and Linux users, moPepGen can be installed using the command below. For Windows users, we recommend installing and running moPepGen from WSL (Windows Subsystem for Linux). Installation should take about one minute to finish.

pip install git+ssh://git@github.com/uclahs-cds/package-moPepGen.git

Install a specific version.

pip install git+ssh://git@github.com/uclahs-cds/package-moPepGen.git@v0.11.3

You can also clone the repo and install it directly from the source code.

git clone git@github.com:uclahs-cds/package-moPepGen.git
cd package-moPepGen
pip install . --use-feature=in-tree-build

Reference Data

moPepGen requires a set of reference files, including the reference genome, its annotation and the translated protein sequences. We currently support reference files downloaded from two sources, ENSEMBL and GENCODE. See here for more details.

A simulated reference set is provided for demonstration. The demo reference set only contains 5 transcripts and is only about 40 KB in size so should run very easily on any computer. The demo reference set can be downloaded with the commands below.

cd ~
mkdir -p moPepGen-demo
cd moPepGen-demo
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/genome.fasta
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/annotation.gtf
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/translate.fasta

Convert reference set into index files for quick access by moPepGen.

moPepGen generateIndex \
    -g genome.fasta \
    -a annotation.gtf \
    -p translate.fasta \
    -o ./index

Parsing

The moPepGen workflow starts with parsing a variety of variant files into GVF, a TSV format derived from VCF, to be used by moPepGen to call variant peptides.

SNP & INDEL

Single nucleotide variants (SNVs/SNPs) and small insertions/deletions (INDELs) called by variant callers (e.g. GATK and Mutect2) must be annotated by the Variant Effect Predictor (VEP) first to get the genes each variant is associated with. Here is an example command with the arguments we commonly use. Please ensure that your ${OUTFILE_FILE} ends in the suffix .tsv or .txt, as they are the formats accepted by parseVEP.

Note that, the VEP cache files must be downloaded prior to running VEP (see here). The VEP developers recommend downloading the VEP cache version which corresponds to the Ensembl VEP installation. We also strongly recommend explicitly providing the exact reference genome and GTF file used in moPepGen to VEP to ensure the consistency of transcript IDs (i.e., using --custom ${ANNOTATION_GTF},${REFERENCE_VERSION},gtf). The exact genome FASTA and annotation GTF files should be used later when calling for variant peptides.

Warning

If you use --chr to limit the chromosomes to annotate, make sure the style matches your VCF/BED file. For example, if the chromosomes have the 'chr' prefix in your VCF file (i.e., chr1, chr2, ...), you must include the prefix as well (i.e., --chr chr1,chr2,chr3).

The example data does not work for VEP.

vep \
    --offline \
    --cache \
    --check_ref \
    --no_stats \
    --fork ${N_THREADS} \
    --buffer_size 10000 \
    --distance 0 \
    --assembly GRCh38 \
    --no_intergenic \
    --chr 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20,21,22,X,Y,M \
    -i ${INPUT_GVF_FILE} \
    -o ${OUTPUT_TSV_FILE} \
    --fasta ${GENOME_FASTA} \
    --custom ${ANNOTATION_GTF},${REFERENCE_VERSION},gtf

If you provided a custom GTF file to VEP, please also use a filter command to only keep the annotation records from the custom reference.

filter_vep \
    --force_overwrite \
    -i ${INPUT_FILE} \
    -o ${OUTPUT_FILE} \
    --filter "Source = ${REFERENCE_VERSION}"

For demonstration, we provide the following VEP output file in TSV format, to be used by moPepGen.

wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/vep/vep_snp.txt

The output VEP TSV file must be parsed by parseVEP into GVF format.

moPepGen parseVEP \
    -i vep_snp.txt \
    --index-dir index \
    -o vep_snp.gvf \
    --source SNV

The --source argument is used to specify the type of variants (e.g., SNP, SNV, INDEL) parsed. The source names are used in later post-processing steps, and are required by all moPepGen parsers.

Fusion

moPepGen provides parsers to three fusion callers, STAR-Fusion, Arriba and FusionCatcher.

As an example, we provide a STAR-Fusion TSV output file for demonstration.

wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/fusion/star_fusion.txt

Parse it into GVF format.

moPepGen parseSTARFusion \
    -i star_fusion.txt \
    --index-dir index \
    --source Fusion \
    -o star_fusion.gvf

By default, parseSTARFusion only keeps fusion events with minimal est_J value of 5. This can be altered by the --min-est-j argument.

Alternative Splicing

moPepGen accepts alternative splicing (AS) events estimated by rMATS. RMATS estimates five AS events: SE (skipped exon), A3SS (alternative 3' splicing), A5SS (alternative 5' splicing), MXE (mutually exclusive exons), and RI (retained introns), accepted by parseRMATS as separate input channels. Noted that only the *.JC.txt files are supported.

Example data:

wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/alternative_splicing/rmats_se_case_1.SE.JC.txt
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/alternative_splicing/rmats_a3ss_case_1.A3SS.JC.txt
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/alternative_splicing/rmats_a5ss_case_1.A5SS.JC.txt
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/alternative_splicing/rmats_mxe_case_1.MXE.JC.txt
wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/alternative_splicing/rmats_ri_case_1.RI.JC.txt

Parse AS events output by rMATS into GVF format with parseRMATS. Note that you don't have to provide all 5 AS files to parseRMATS. By default, parseRMATS only accepts AS events with inclusion and exclusion junction counts of at least 1. These cutoffs can be set by --min-ijc and --min-sjc. See here for a complete list of arguments.

RNA Editing Sites

RNA editing sites are specific positions within mRNA molecules where nucleotides undergo post-transcriptional modifications. moPepGen supports RNA editing sites called by REDItools. Note that the REDItools output must be annotated by the AnnotateTable.py from the REDItools package prior to being passed to parseREDItools. Below is the command that can be used to perform the annotation. Note that the ${ANNOTATION_GTF} must be the same file later used in parseREDItools and callVariant. ${PREFIX} is the prefix of column names for gene and transcript IDs.

AnnotateTable.py \
    -i ${INPUT_TXT} \
    -a ${ANNOTATION_GTF} \
    -s 4 \
    -c 1,2,3 \
    -n ${PREFIX} \
    -o ${OUTPUT_TXT}

Example data:

wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/reditools/reditools_annotated.txt

Parse REDItools output to GVF: By default, parseREDItools looks for the transcript ID in column 17. This can be changed with --transcript-id-column, which takes a 1-based column number. See here for a complete list of arguments.

CircRNA

CircRNAs are commonly recognized as noncoding RNAs, but evidence has shown that they are potentially translatable. moPepGen accepts circRNA events called by CIRCexplorer.

Download demo data:

wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/circRNA/CIRCexplorer_circularRNA_known.txt

Parse it into GVF format.

moPepGen parseCIRCexplorer \
    -i CIRCexplorer_circularRNA_known.txt \
    -o circRNA_CIRCexplorer.gvf \
    --index-dir index \
    --source CircRNA

By default parseCIRCexplorer accepts the text file output by CIRCexplorer2, however, CIRCexplorer3 is also supported with the --circexplorer3 flag. We also provide a series of filtering parameters that can be found here.

Non-canonical Peptides Calling

moPepGen provides three commands for non-canonical peptide calling. callVariant for calling peptides from variant GVFs, callNovelORF for performing 3-frame translation on transcripts to find novel ORFs, and callAltTranslation for calling peptides that harbor alternative translation events such as selenocysteine termination and W > F substitutants.

Variant Peptides

Variant peptides, peptides that harbor any variant, can be called using the callVariant command. callVariant must take one or more GVF files produced by moPepGen parsers.

moPepGen callVariant \
    -i vep_snp.gvf star_fusion.gvf alt_splice_rmats.gvf rna_editing_reditools.gvf circRNA_CIRCexplorer.gvf \
    --index-dir index \
    -o variant_peptides.fasta \
    --threads 4

callVariant supports multi-processing and the number of processors to use can be specified with the --threads argument. The --selenocysteine-termination and --w2f-reassignment arguments can be used to call variant peptides that also carry selenocysteine termination and W2F reassignment. By default, callVariant uses trypsin as the enzyme for in silico digestion and allows up to 2 miscleavages, and this can be specified with --cleavage-rule and --miscleavage. See here for a complete list of arguments supported by callVariant.

Novel ORF Peptides

Novel ORF peptides, peptides that could potentially be translated from novel open reading frames, can be called using callNovelORF. By default, callNovelORF only searches for novel ORFs from transcripts canonically annotated as noncoding. Protein coding transcripts can be included using --coding-novel-orf. Note that callNovelORF does not take any variants as input but only works with the reference set of novel ORF transcripts. There is no need to rerun callNovelORF unless you wish to use a different enzyme or reference set.

moPepGen callNovelORF \
    --index-dir index \
    -o novel_orf_peptides.fasta

Similar to callVariant, trypsin is the default enzyme and the default maximum number of miscleavages is 2. These can be specified with --cleavage-rule and --miscleavage. See here for a complete list of arguments supported by callNovelORF.

Alternative Translation Peptides

Alternative translation peptides are those that harbor special events during translation, such as selenocysteine termination and W > F substitutants, where the genetic code is not altered but a different polypeptide is produced (see here for more details). Similar to novel ORF peptides, callAltTranslation only calls peptides using reference transcripts.

moPepGen callAltTranslation \
    --index-dir index \
    -o alt_trans_peptides.fasta

And again, callAltTranslation also uses trypsin as the default enzyme, and up to 2 miscleavages by default. See here for a complete list of arguments.

Post-processing

moPepGen provides a series of post-processing commands that aim to deliver FASTA files ready for database searching. The post-processing tasks include generating summary statistics of a non-canonical database, filtering a database by transcript abundance, splitting a database to separate tiered databases, creating decoy databases, shortening fasta headers for easy handling by search engines, and merging multiple database files for multiplexed proteomic experiments.

Summarizing

The typical first step after generating a variant peptide database is to inspect its contents. summarizeFasta takes a variant peptide fasta file output by callVariant and summarizes the variant peptides by categories based on the value of --source you input when calling the parser commands. summarizeFasta must take all GVFs used as input to callVariant.

moPepGen summarizeFasta \
    --gvf vep_snp.gvf star_fusion.gvf alt_splice_rmats.gvf rna_editing_reditools.gvf circRNA_CIRCexplorer.gvf \
    --variant-peptides variant_peptides.fasta \
    --index-dir index

By default, summarizeFasta outputs to stdout, which can be saved to a text file using the -o argument. In case --output-image is given, a horizontal bar plot will be saved.

summarize-fasta-bar-plot

Because moPepGen calls enzymatically cleaved peptides, there is the possibility that the same peptide is called from multiple transcripts, or the same transcript with different or different combinations of variants. For example, the peptide below is called twice from two separate transcripts with two different variants.

>ENST00000622235.5|SNV-100-G-T|4 ENST00000614167.2|RES-202-G-A|2
HETLFLLTFPR

To resolve the issue of collapsed peptides like the example above, we use the --order-source argument that takes the priority order of sources considered. It takes the source names in a comma-separated format. For example --order-source gSNP,RNAEditing will prioritize gSNP over RNA editing events, thus the example peptide above will be assigned to the gSNP category. Note that the values passed into --order-source must match the values used in --source in the corresponding parser calls. If --order-source is not provided, the source priority order will be inferred from the order of input GVF files.

Besides variant peptides called by callVariant, novel ORF peptides and alternative translation peptides can also be passed to summarizeFasta with --novel-orf-peptides and --alt-translation-peptides.

Filtering

The fitlerFasta module is designed to take an RNA abundance matrix and filter the non-canonical peptides based on the abundances of their corresponding transcripts.

We provide an example RSEM table for demonstration.

wget https://github.com/uclahs-cds/package-moPepGen/raw/main/test/files/rsem/rsem.txt

moPepGen filterFasta \
    -i variant_peptides.fasta \
    -o variant_peptides_filtered.fasta \
    --exprs-table rsem.txt \
    --skip-lines 1 \
    --tx-id-col 1 \
    --quant-col 6 \
    --quant-cutoff 10 \
    --index ./index

--skip-lines indicates the top x number of lines in the abundance matrix that should be skipped. These lines are usually headers or notes output by the quantitation software. --tx-id-col is used to specify the 1-based column number for transcript IDs. Any type of quantitation metrics can be used for filtering (e.g. count, TPM, FPKM) as long as a desired cutoff is given in --quant-cutoff (inclusive). Note that the same reference GTF should be used during quantitation so the transcript IDs can be mapped correctly to variant peptides called by moPepGen.

Splitting

The splitFasta command is provided to split a variant peptide database into several separate databases for tiered database searching, particularly for the purpose of database-specific false discovery rate control.

mkdir -p split
moPepGen splitFasta \
    --variant-peptides variant_peptides.fasta \
    --gvf vep_snp.gvf star_fusion.gvf alt_splice_rmats.gvf rna_editing_reditools.gvf circRNA_CIRCexplorer.gvf \
    --output-prefix split/split \
    --index-dir index

Similar to summarizeFasta, splitFasta also takes a --order-source to specify the priority order of which category a peptide should be assigned to, which will be inferred from the input GVFs if not specified. --group-source is used to group sources as a super category. For example, --group-source Germline:gSNP,gINDEL Somatic:sSNV,sINDEL will group sources of gSNP and gINDEL together as Germline, and sSNV and sINDEL as Somatic.

Note that, when assigning a peptide to a source category, it must carry exclusively the desired type(s) of variants. For example, a peptide of 'ENST00000622235.5|SNV-100-G-T|SNV-110-C-A|2' is assigned to SNV, while a peptide of 'ENST00000622235.5|SNV-100-G-T|RES-110-C-A|2' will be assigned to the category of SNV-RNAEditing but not SNV. --max-source-groups is used to specify the maximum number of source groups that should be split into individual FASTA files. The default value is 1, which means all peptides that contain two or more types of variants will not be written into their own FASTA file, but kept in the '\<prefix>_Remaining.fasta' file.

Similar to summarizeFasta, novel ORF and alternative translation peptides can be passed to splitFasta via --novel-orf-peptides and --alt-translation-peptides.

See here for a complete list of arguments.

Target-Decoy Database

Most search engines expect a target-decoy database as input to estimate the false discovery rate (FDR). We provide a decoyFasta command, that takes a variant peptide database and adds decoy sequences with either the reverse or shuffle algorithm.

moPepGen decoyFasta \
    -i split/split_gSNP_encode.fasta \
    -o split/split_gSNP_decoy.fasta \
    --method reverse

By default, amino acid residues at cleavage sites are unmodified. Trypsin is the default enzyme and can be changed using --enzyme. --non-shuffle-pattern can be used to specify additional amino acid residues to be fixed in the decoy sequence. By default, the N- and C-terminal residues are also fixed, which can be turned off by setting --keep-peptide-nterm or --keep-peptide-cterm to false. See here for a complete list of arguments.

Shortening FASTA headers

Some search engines have limits on how long the database FASTA headers can be. The headers of moPepGen's variant FASTA files could be very long, because the same peptide can be called from different transcripts. We provide a shortening approach by replacing all FASTA headers with UUIDs and storing the mapping information in a .dict file with the same prefix.

moPepGen encodeFasta \
    -i split/split_gSNP.fasta \
    -o split/split_gSNP_encode.fasta

Note that for decoy peptides, the same UUID as in their target sequences will be used, with the decoy prefix/suffix retained. The resulting .dict file can be used for mapping both back to the original FASTA header.