filterFasta
filterFasta
takes the FASTA file of variant peptides (output by
callVariant
) or novel ORF peptides (output by callNovelORF
) and filters it
based on the gene expression data. A expresion table must be given as a CSV
or TSV.
Usage
usage: moPepGen filterFasta [-h] -i <file> -o <file> [--denylist <file>]
[--exprs-table <file>] [--skip-lines <value>]
[--delimiter <value>] [--tx-id-col <number>]
[--quant-col <number>] [--quant-cutoff <number>]
[--keep-all-coding] [--keep-all-noncoding]
[--keep-canonical] [--miscleavages [min]:[max]]
[--enzyme {arg-c,asp-n,bnps-skatole,caspase 1,caspase 2,caspase 3,caspase 4,caspase 5,caspase 6,caspase 7,caspase 8,caspase 9,caspase 10,chymotrypsin high specificity,chymotrypsin low specificity,clostripain,cnbr,enterokinase,factor xa,formic acid,glutamyl endopeptidase,granzyme b,hydroxylamine,iodosobenzoic acid,lysc,lysn,ntcb,pepsin ph1.3,pepsin ph2.0,proline endopeptidase,proteinase k,staphylococcal peptidase i,thermolysin,thrombin,trypsin,trypsin_exception}]
[-a <file>] [--reference-source {GENCODE,ENSEMBL}]
[--index-dir [<file>]]
[--debug-level <value|number>] [-q]
Filter noncanonical peptides according to gene expression or gene biotypes.
optional arguments:
-h, --help show this help message and exit
-i <file>, --input-path <file>
Input FASTA file, must be generated by either moPepGen
callVariant or callNovelORF. Valid formats: ['.fasta',
'.fa'] (default: None)
-o <file>, --output-path <file>
File path to the output file. Valid formats:
['.fasta', '.fa'] (default: None)
--denylist <file> Path to the peptide sequence deny list. When using
novel ORF peptides as denylist, make sure it is no
also passed as a input FASTA file, because all peptide
sequences will be removed. Valid formats: [".fasta"]
(default: None)
--exprs-table <file> Path to the RNAseq quantification results. (default:
None)
--skip-lines <value> Number of lines to skip when reading the expression
table.Defaults to 0 (default: 0)
--delimiter <value> Delimiter of the expression table. Defaults to tab.
(default: )
--tx-id-col <number> The index for transcript ID in the RNAseq
quantification results. Index is 1-based. (default:
None)
--quant-col <number> The column index number for quantification. Index is
1-based. (default: None)
--quant-cutoff <number>
Quantification cutoff. (default: None)
--keep-all-coding Keep all coding genes, regardless of their expression
level. (default: False)
--keep-all-noncoding Keep all noncoding genes, regardless of their
expression level. (default: False)
--keep-canonical Keep peptides called from canonical ORFs. Only useful
together with denylist. (default: False)
--miscleavages [min]:[max]
Range of miscleavages per peptide to allow. Min and
max are included. For example, "1:2" will keep all
peptides with 1 or 2 miscleavages. (default: None)
--enzyme {arg-c,asp-n,bnps-skatole,caspase 1,caspase 2,caspase 3,caspase 4,caspase 5,caspase 6,caspase 7,caspase 8,caspase 9,caspase 10,chymotrypsin high specificity,chymotrypsin low specificity,clostripain,cnbr,enterokinase,factor xa,formic acid,glutamyl endopeptidase,granzyme b,hydroxylamine,iodosobenzoic acid,lysc,lysn,ntcb,pepsin ph1.3,pepsin ph2.0,proline endopeptidase,proteinase k,staphylococcal peptidase i,thermolysin,thrombin,trypsin,trypsin_exception}
Enzyme name. Ignored if --miscleavages is not
specified. (default: trypsin)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Reference Files:
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
--index-dir [<file>] Path to the directory of index files generated by
moPepGen generateIndex. If given, --genome-fasta,
--proteome-fasta and --anntotation-gtf will be
ignored. (default: None)
Arguments
-h, --help
show this help message and exit
-i, --input-path <file> Path
Input FASTA file, must be generated by either moPepGen callVariant or callNovelORF. Valid formats: ['.fasta', '.fa']
-o, --output-path <file> Path
File path to the output file. Valid formats: ['.fasta', '.fa']
--denylist <file> Path
Path to the peptide sequence deny list. When using novel ORF peptides as denylist, make sure it is no also passed as a input FASTA file, because all peptide sequences will be removed. Valid formats: [".fasta"]
--exprs-table <file> Path
Path to the RNAseq quantification results.
--skip-lines <value> int
Number of lines to skip when reading the expression table.Defaults to 0
int
Default: 0
--delimiter <value> str
Delimiter of the expression table. Defaults to tab.
str
Default:
--tx-id-col <number> str
The index for transcript ID in the RNAseq quantification results. Index is 1-based.
--quant-col <number> str
The column index number for quantification. Index is 1-based.
--quant-cutoff <number> float
Quantification cutoff.
--keep-all-coding
Keep all coding genes, regardless of their expression level.
Default: False
--keep-all-noncoding
Keep all noncoding genes, regardless of their expression level.
Default: False
--keep-canonical
Keep peptides called from canonical ORFs. Only useful together with denylist.
Default: False
--miscleavages [min]:[max] str
Range of miscleavages per peptide to allow. Min and max are included. For example, "1:2" will keep all peptides with 1 or 2 miscleavages.
--enzyme str
Enzyme name. Ignored if --miscleavages is not specified.
str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
--index-dir <file> Path
Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False
Examples
Filter by Expression
The example below filters the variant peptide sequences based on their expression level. The expression table is given as TSV file, with the first column being the transcript ID, and the fourth column being the expression level. Peptides are removed if the transcript it is associated has the expression level smaller than 2. Any transcript quantitation value can be used, including read count, TPM, and FPKM.
moPepGen fitlerFasta \
--input-path path/to/variant_peptides.fasta \
--output-path path/to/variant_peptides_filter.fasta \
--exprs-table path/to/expression.tsv \
--delimiter '\t' \
--tx-id-col 1 \
--quant-col 4 \
--quant-cutoff 2
Filter by Expression and Miscleavages
This example is the same as above with the addition of a filter by miscleavages. Any peptides with more than 3 miscleavages will be dropped.
moPepGen fitlerFasta \
--input-path path/to/variant_peptides.fasta \
--output-path path/to/variant_peptides_filter.fasta \
--exprs-table path/to/expression.tsv \
--delimiter '\t' \
--tx-id-col 1 \
--quant-col 4 \
--quant-cutoff 2 \
--miscleavages 0:2 \
--enzyme trypsin
Filter Variant and Novel ORF Peptides
This example takes both the variant peptide FASTA and the novel ORF peptide FASTA and filters the peptides based on the expression level of the transcripts they are associated with.
moPepGen fitlerFasta \
--input-path \
path/to/variant_peptides.fasta \
path/to/novel_orf_peptides.fasta \
--output-path path/to/variant_peptides_filter.fasta \
--exprs-table path/to/expression.tsv \
--delimiter '\t' \
--tx-id-col 1 \
--quant-col 4 \
--quant-cutoff 2
Filter by Denylist
This example here removes any peptide sequences that appear in the given denylist.
!!! warning: When using novel ORF peptides in a denylist, do not also pass the novel ORF peptide FASTA as an input FASTA, because all novel ORF peptides will be removed from the output.
moPepGen fitlerFasta \
--input-path path/to/variant_peptides.fasta \
--output-path path/to/variant_peptides_filter.fasta \
--denylist path/to/denylist.fasta
Use the --keep-canonical
option to keep peptides that are called from canonical ORFs even if they are in the denylist. Canonical ORFs include coding transcripts with mutation(s) and fusion transcripts where the upstream transcript is coding. Peptides called from circRNAs are considered noncanonical ORFs.
moPepGen filterFasta \
--input-path path/to/variant_peptides.fasta \
--output-path path/to/variant_peptides_filter.fasta \
--denylist path/to/denylist.fasta \
--keep-canonical
Complex Filtering
Sometimes we want a more complex filtering strategy. In the example below, we want to first remove any variant peptides that overlap with any novel ORF peptides, and then filter again based on the expression level.
Remove variant peptides if they overlap with any novel ORF peptide.
moPepGen fitlerFasta \
--input-path variant_peptides.fasta \
--output-path variant_peptides_filter.fasta \
--denylist novel_orf_peptide.fasta
Filter again with both filtered variant peptides and novel ORF peptides based on expression level.
moPepGen fitlerFasta \
--input-path \
variant_peptides_filter.fasta \
novel_orf_peptide.fasta \
--output-path combined_filter.fasta \
--exprs-table expression.tsv \
--delimiter '\t' \
--tx-id-col 1 \
--quant-col 4 \
--quant-cutoff 2