filterFasta

filterFasta takes the FASTA file of variant peptides (output by callVariant) or novel ORF peptides (output by callNovelORF) and filters it based on the gene expression data. A expresion table must be given as a CSV or TSV.

Usage

usage: moPepGen filterFasta [-h] -i <file> -o <file> [--denylist <file>]
                            [--exprs-table <file>] [--skip-lines <value>]
                            [--delimiter <value>] [--tx-id-col <number>]
                            [--quant-col <number>] [--quant-cutoff <number>]
                            [--keep-all-coding] [--keep-all-noncoding]
                            [--keep-canonical] [--miscleavages [min]:[max]]
                            [--enzyme {arg-c,asp-n,bnps-skatole,caspase 1,caspase 2,caspase 3,caspase 4,caspase 5,caspase 6,caspase 7,caspase 8,caspase 9,caspase 10,chymotrypsin high specificity,chymotrypsin low specificity,clostripain,cnbr,enterokinase,factor xa,formic acid,glutamyl endopeptidase,granzyme b,hydroxylamine,iodosobenzoic acid,lysc,lysn,ntcb,pepsin ph1.3,pepsin ph2.0,proline endopeptidase,proteinase k,staphylococcal peptidase i,thermolysin,thrombin,trypsin,trypsin_exception}]
                            [-a <file>] [--reference-source {GENCODE,ENSEMBL}]
                            [--codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}]
                            [--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
                            [--start-codons [START_CODONS [START_CODONS ...]]]
                            [--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
                            [--index-dir [<file>]]
                            [--debug-level <value|number>] [-q]

Filter noncanonical peptides according to gene expression or gene biotypes.

optional arguments:
  -h, --help            show this help message and exit
  -i <file>, --input-path <file>
                        Input FASTA file, must be generated by either moPepGen
                        callVariant or callNovelORF. Valid formats: ['.fasta',
                        '.fa'] (default: None)
  -o <file>, --output-path <file>
                        File path to the output file. Valid formats:
                        ['.fasta', '.fa'] (default: None)
  --denylist <file>     Path to the peptide sequence deny list. When using
                        novel ORF peptides as denylist, make sure it is no
                        also passed as a input FASTA file, because all peptide
                        sequences will be removed. Valid formats: [".fasta"]
                        (default: None)
  --exprs-table <file>  Path to the RNAseq quantification results. (default:
                        None)
  --skip-lines <value>  Number of lines to skip when reading the expression
                        table.Defaults to 0 (default: 0)
  --delimiter <value>   Delimiter of the expression table. Defaults to tab.
                        (default: )
  --tx-id-col <number>  The index for transcript ID in the RNAseq
                        quantification results. Index is 1-based. (default:
                        None)
  --quant-col <number>  The column index number for quantification. Index is
                        1-based. (default: None)
  --quant-cutoff <number>
                        Quantification cutoff. (default: None)
  --keep-all-coding     Keep all coding genes, regardless of their expression
                        level. (default: False)
  --keep-all-noncoding  Keep all noncoding genes, regardless of their
                        expression level. (default: False)
  --keep-canonical      Keep peptides called from canonical ORFs. Only useful
                        together with denylist. (default: False)
  --miscleavages [min]:[max]
                        Range of miscleavages per peptide to allow. Min and
                        max are included. For example, "1:2" will keep all
                        peptides with 1 or 2 miscleavages. (default: None)
  --enzyme {arg-c,asp-n,bnps-skatole,caspase 1,caspase 2,caspase 3,caspase 4,caspase 5,caspase 6,caspase 7,caspase 8,caspase 9,caspase 10,chymotrypsin high specificity,chymotrypsin low specificity,clostripain,cnbr,enterokinase,factor xa,formic acid,glutamyl endopeptidase,granzyme b,hydroxylamine,iodosobenzoic acid,lysc,lysn,ntcb,pepsin ph1.3,pepsin ph2.0,proline endopeptidase,proteinase k,staphylococcal peptidase i,thermolysin,thrombin,trypsin,trypsin_exception}
                        Enzyme name. Ignored if --miscleavages is not
                        specified. (default: trypsin)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid
                        Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'} (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear',
                        'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'}. By default, "SGC1" is
                        assigned to mitochondrial chromosomes. (default: [])
  --start-codons [START_CODONS [START_CODONS ...]]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Arguments

-h, --help

show this help message and exit

-i, --input-path <file> Path

Input FASTA file, must be generated by either moPepGen callVariant or callNovelORF. Valid formats: ['.fasta', '.fa']

-o, --output-path <file> Path

File path to the output file. Valid formats: ['.fasta', '.fa']

--denylist <file> Path

Path to the peptide sequence deny list. When using novel ORF peptides as denylist, make sure it is no also passed as a input FASTA file, because all peptide sequences will be removed. Valid formats: [".fasta"]

--exprs-table <file> Path

Path to the RNAseq quantification results.

--skip-lines <value> int

Number of lines to skip when reading the expression table.Defaults to 0 int
Default: 0

--delimiter <value> str

Delimiter of the expression table. Defaults to tab. str
Default:

--tx-id-col <number> str

The index for transcript ID in the RNAseq quantification results. Index is 1-based.

--quant-col <number> str

The column index number for quantification. Index is 1-based.

--quant-cutoff <number> float

Quantification cutoff.

--keep-all-coding

Keep all coding genes, regardless of their expression level.
Default: False

--keep-all-noncoding

Keep all noncoding genes, regardless of their expression level.
Default: False

--keep-canonical

Keep peptides called from canonical ORFs. Only useful together with denylist.
Default: False

--miscleavages [min]:[max] str

Range of miscleavages per peptide to allow. Min and max are included. For example, "1:2" will keep all peptides with 1 or 2 miscleavages.

--enzyme str

Enzyme name. Ignored if --miscleavages is not specified. str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'} str
Default: Standard
Choices: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False

Examples

Filter by Expression

The example below filters the variant peptide sequences based on their expression level. The expression table is given as TSV file, with the first column being the transcript ID, and the fourth column being the expression level. Peptides are removed if the transcript it is associated has the expression level smaller than 2. Any transcript quantitation value can be used, including read count, TPM, and FPKM.

moPepGen fitlerFasta \
  --input-path path/to/variant_peptides.fasta \
  --output-path path/to/variant_peptides_filter.fasta \
  --exprs-table path/to/expression.tsv \
  --delimiter '\t' \
  --tx-id-col 1 \
  --quant-col 4 \
  --quant-cutoff 2

Filter by Expression and Miscleavages

This example is the same as above with the addition of a filter by miscleavages. Any peptides with more than 3 miscleavages will be dropped.

moPepGen fitlerFasta \
  --input-path path/to/variant_peptides.fasta \
  --output-path path/to/variant_peptides_filter.fasta \
  --exprs-table path/to/expression.tsv \
  --delimiter '\t' \
  --tx-id-col 1 \
  --quant-col 4 \
  --quant-cutoff 2 \
  --miscleavages 0:2 \
  --enzyme trypsin

Filter Variant and Novel ORF Peptides

This example takes both the variant peptide FASTA and the novel ORF peptide FASTA and filters the peptides based on the expression level of the transcripts they are associated with.

moPepGen fitlerFasta \
  --input-path \
    path/to/variant_peptides.fasta \
    path/to/novel_orf_peptides.fasta \
  --output-path path/to/variant_peptides_filter.fasta \
  --exprs-table path/to/expression.tsv \
  --delimiter '\t' \
  --tx-id-col 1 \
  --quant-col 4 \
  --quant-cutoff 2

Filter by Denylist

This example here removes any peptide sequences that appear in the given denylist.

!!! warning: When using novel ORF peptides in a denylist, do not also pass the novel ORF peptide FASTA as an input FASTA, because all novel ORF peptides will be removed from the output.

moPepGen fitlerFasta \
  --input-path path/to/variant_peptides.fasta \
  --output-path path/to/variant_peptides_filter.fasta \
  --denylist path/to/denylist.fasta

Use the --keep-canonical option to keep peptides that are called from canonical ORFs even if they are in the denylist. Canonical ORFs include coding transcripts with mutation(s) and fusion transcripts where the upstream transcript is coding. Peptides called from circRNAs are considered noncanonical ORFs.

moPepGen filterFasta \
  --input-path path/to/variant_peptides.fasta \
  --output-path path/to/variant_peptides_filter.fasta \
  --denylist path/to/denylist.fasta \
  --keep-canonical

Complex Filtering

Sometimes we want a more complex filtering strategy. In the example below, we want to first remove any variant peptides that overlap with any novel ORF peptides, and then filter again based on the expression level.

Remove variant peptides if they overlap with any novel ORF peptide.

moPepGen fitlerFasta \
  --input-path variant_peptides.fasta \
  --output-path variant_peptides_filter.fasta \
  --denylist novel_orf_peptide.fasta

Filter again with both filtered variant peptides and novel ORF peptides based on expression level.

moPepGen fitlerFasta \
  --input-path \
    variant_peptides_filter.fasta \
    novel_orf_peptide.fasta \
  --output-path combined_filter.fasta \
  --exprs-table expression.tsv \
  --delimiter '\t' \
  --tx-id-col 1 \
  --quant-col 4 \
  --quant-cutoff 2