summarizeFasta
summarizeFasta
takes a variant peptide FASTA file output by callVariant
and summarize the count of variant peptides of each source groups. This
summary can then guide the database splitting for tiered custom database
searching.
Usage
usage: moPepGen summarizeFasta [-h] [--gvf <files> [<files> ...]]
[--variant-peptides <file>]
[--novel-orf-peptides <file>]
[--alt-translation-peptides ALT_TRANSLATION_PEPTIDES]
[--order-source <value>] [-o <file>]
[--group-source [<value> [<value> ...]]]
[--output-image <file>]
[--ignore-missing-source]
[--plot-normal-scale | --plot-log-scale]
[-c <value>] [--cleavage-exception <value>]
[-a <file>]
[--reference-source {GENCODE,ENSEMBL}]
[--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}]
[--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
[--start-codons [START_CODONS [START_CODONS ...]]]
[--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
[-p <file>] [--invalid-protein-as-noncoding]
[--index-dir [<file>]]
[--debug-level <value|number>] [-q]
Summarize the variant peptide calling results
optional arguments:
-h, --help show this help message and exit
--gvf <files> [<files> ...]
File path to GVF files. All GVF files must be
generated by moPepGen parsers. Valid formats: ['.gvf']
(default: None)
--variant-peptides <file>
File path to the variant peptide FASTA database file.
Must be generated by moPepGen callVariant. Valid
formats: ['.fasta', '.fa'] (default: None)
--novel-orf-peptides <file>
File path to the novel ORF peptide FASTA database
file. Must be generated by moPepGen callNovelORF.
Valid formats: ['.fasta', '.fa'] (default: None)
--alt-translation-peptides ALT_TRANSLATION_PEPTIDES
File path to the alt translation peptide FASTA file.
Must begenerated by moPepGen callAltTranslation. Valid
formats: ['.fasta', '.fa'] (default: None)
--order-source <value>
Order of sources, separate by comma. E.g.,
SNP,SNV,Fusion (default: None)
-o <file>, --output-path <file>
File path to the output file. If not given, the
summary table is printed to stdout. Valid formats:
['.txt', 'tsv'] (default: None)
--group-source [<value> [<value> ...]]
Group sources. The peptides with sources grouped will
be written to the same FASTA file. E.g.,
"PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
(default: None)
--output-image <file>
File path to the output barplot. Valid formats:
['.pdf', '.jpg', '.jpeg', '.png'] (default: None)
--ignore-missing-source
Ignore the sources missing from input GVF. (default:
False)
--plot-normal-scale Draw the summary bar plot in normal scale. (default:
False)
--plot-log-scale Draw the summary bar plot in log scale. (default:
False)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Cleavage Parameters:
-c <value>, --cleavage-rule <value>
Enzymatic cleavage rule. (default: trypsin)
--cleavage-exception <value>
Enzymatic cleavage exception. (default: auto)
Reference Files:
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}
Codon table. Defaults to "Standard". Supported codon
tables: {'Alternative Yeast Nuclear', 'Protozoan
Mitochondrial', 'Vertebrate Mitochondrial',
'Blepharisma Macronuclear', 'Chlorophycean
Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
'Trematode Mitochondrial', 'Pachysolen tannophilus
Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
'Euplotid Nuclear', 'Scenedesmus obliquus
Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
'Coelenterate Mitochondrial', 'Bacterial', 'Mold
Mitochondrial', 'SGC3', 'Hexamita Nuclear',
'Pterobranchia Mitochondrial', 'Plant Plastid',
'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
'Gracilibacteria', 'Alternative Flatworm
Mitochondrial', 'Echinoderm Mitochondrial',
'Invertebrate Mitochondrial', 'SGC0', 'Candidate
Division SR1', 'Dasycladacean Nuclear', 'SGC4',
'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
'Standard', 'Karyorelict Nuclear'} (default: Standard)
--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
Chromosome specific codon table. Must be specified in
the format of "chrM:SGC1", where "chrM" is the
chromosome name and "SGC1" is the codon table to use
to translate genes on chrM. Supported codon tables:
{'Alternative Yeast Nuclear', 'Protozoan
Mitochondrial', 'Vertebrate Mitochondrial',
'Blepharisma Macronuclear', 'Chlorophycean
Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
'Trematode Mitochondrial', 'Pachysolen tannophilus
Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
'Euplotid Nuclear', 'Scenedesmus obliquus
Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
'Coelenterate Mitochondrial', 'Bacterial', 'Mold
Mitochondrial', 'SGC3', 'Hexamita Nuclear',
'Pterobranchia Mitochondrial', 'Plant Plastid',
'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
'Gracilibacteria', 'Alternative Flatworm
Mitochondrial', 'Echinoderm Mitochondrial',
'Invertebrate Mitochondrial', 'SGC0', 'Candidate
Division SR1', 'Dasycladacean Nuclear', 'SGC4',
'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
'Standard', 'Karyorelict Nuclear'}. By default, "SGC1"
is assigned to mitochondrial chromosomes. (default:
[])
--start-codons [START_CODONS [START_CODONS ...]]
Default start codon(s) to use for novel ORF
translation. Defaults to ["ATG"]. (default: ['ATG'])
--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
Chromosome specific start codon(s). For example,
"chrM:ATG,ATA,ATT".By defualt, mitochondrial
chromosome name is automatically inferred andstart
codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
assigned to it. (default: [])
-p <file>, --proteome-fasta <file>
Path to the translated protein sequence FASTA file.
Only ENSEMBL and GENCODE are supported. Its version
must be the same as genome FASTA and annotation GTF.
(default: None)
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is
invalid ( contains the * symbol) as noncoding.
(default: False)
--index-dir [<file>] Path to the directory of index files generated by
moPepGen generateIndex. If given, --genome-fasta,
--proteome-fasta and --anntotation-gtf will be
ignored. (default: None)
Arguments
-h, --help
show this help message and exit
--gvf <files> Path
File path to GVF files. All GVF files must be generated by moPepGen parsers. Valid formats: ['.gvf']
--variant-peptides <file> Path
File path to the variant peptide FASTA database file. Must be generated by moPepGen callVariant. Valid formats: ['.fasta', '.fa']
--novel-orf-peptides <file> Path
File path to the novel ORF peptide FASTA database file. Must be generated by moPepGen callNovelORF. Valid formats: ['.fasta', '.fa']
--alt-translation-peptides Path
File path to the alt translation peptide FASTA file. Must begenerated by moPepGen callAltTranslation. Valid formats: ['.fasta', '.fa']
--order-source <value> str
Order of sources, separate by comma. E.g., SNP,SNV,Fusion
-o, --output-path <file> Path
File path to the output file. If not given, the summary table is printed to stdout. Valid formats: ['.txt', 'tsv']
--group-source <value> str
Group sources. The peptides with sources grouped will be written to the same FASTA file. E.g., "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
--output-image <file> Path
File path to the output barplot. Valid formats: ['.pdf', '.jpg', '.jpeg', '.png']
--ignore-missing-source
Ignore the sources missing from input GVF.
Default: False
--plot-normal-scale
Draw the summary bar plot in normal scale.
Default: False
--plot-log-scale
Draw the summary bar plot in log scale.
Default: False
-c, --cleavage-rule <value> str
Enzymatic cleavage rule.
str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']
--cleavage-exception <value> str
Enzymatic cleavage exception.
str
Default: auto
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
--codon-table str
Codon table. Defaults to "Standard". Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}
str
Default: Standard
Choices: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}
--chr-codon-table str
Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}. By default, "SGC1" is assigned to mitochondrial chromosomes.
str
Default: []
--start-codons str
Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"].
str
Default: ['ATG']
--chr-start-codons str
Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it.
str
Default: []
-p, --proteome-fasta <file> Path
Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False
--index-dir <file> Path
Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False