summarizeFasta

summarizeFasta takes a variant peptide FASTA file output by callVariant and summarize the count of variant peptides of each source groups. This summary can then guide the database splitting for tiered custom database searching.

Usage

usage: moPepGen summarizeFasta [-h] [--gvf <files> [<files> ...]]
                               [--variant-peptides <file>]
                               [--novel-orf-peptides <file>]
                               [--alt-translation-peptides ALT_TRANSLATION_PEPTIDES]
                               [--order-source <value>] [-o <file>]
                               [--group-source [<value> [<value> ...]]]
                               [--output-image <file>]
                               [--ignore-missing-source]
                               [--plot-normal-scale | --plot-log-scale]
                               [-c <value>] [--cleavage-exception <value>]
                               [-a <file>]
                               [--reference-source {GENCODE,ENSEMBL}]
                               [--codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}]
                               [--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
                               [--start-codons [START_CODONS [START_CODONS ...]]]
                               [--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
                               [-p <file>] [--invalid-protein-as-noncoding]
                               [--index-dir [<file>]]
                               [--debug-level <value|number>] [-q]

Summarize the variant peptide calling results

optional arguments:
  -h, --help            show this help message and exit
  --gvf <files> [<files> ...]
                        File path to GVF files. All GVF files must be
                        generated by moPepGen parsers. Valid formats: ['.gvf']
                        (default: None)
  --variant-peptides <file>
                        File path to the variant peptide FASTA database file.
                        Must be generated by moPepGen callVariant. Valid
                        formats: ['.fasta', '.fa'] (default: None)
  --novel-orf-peptides <file>
                        File path to the novel ORF peptide FASTA database
                        file. Must be generated by moPepGen callNovelORF.
                        Valid formats: ['.fasta', '.fa'] (default: None)
  --alt-translation-peptides ALT_TRANSLATION_PEPTIDES
                        File path to the alt translation peptide FASTA file.
                        Must begenerated by moPepGen callAltTranslation. Valid
                        formats: ['.fasta', '.fa'] (default: None)
  --order-source <value>
                        Order of sources, separate by comma. E.g.,
                        SNP,SNV,Fusion (default: None)
  -o <file>, --output-path <file>
                        File path to the output file. If not given, the
                        summary table is printed to stdout. Valid formats:
                        ['.txt', 'tsv'] (default: None)
  --group-source [<value> [<value> ...]]
                        Group sources. The peptides with sources grouped will
                        be written to the same FASTA file. E.g.,
                        "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
                        (default: None)
  --output-image <file>
                        File path to the output barplot. Valid formats:
                        ['.pdf', '.jpg', '.jpeg', '.png'] (default: None)
  --ignore-missing-source
                        Ignore the sources missing from input GVF. (default:
                        False)
  --plot-normal-scale   Draw the summary bar plot in normal scale. (default:
                        False)
  --plot-log-scale      Draw the summary bar plot in log scale. (default:
                        False)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Cleavage Parameters:
  -c <value>, --cleavage-rule <value>
                        Enzymatic cleavage rule. (default: trypsin)
  --cleavage-exception <value>
                        Enzymatic cleavage exception. (default: auto)

Reference Files:
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid
                        Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'} (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear',
                        'Ascidian Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
                        'Condylostoma Nuclear', 'Flatworm Mitochondrial',
                        'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
                        tannophilus Nuclear', 'Archaeal', 'Chlorophycean
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
                        'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
                        'Yeast Mitochondrial', 'Mesodinium Nuclear',
                        'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
                        Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
                        'Blepharisma Macronuclear', 'Mold Mitochondrial',
                        'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Coelenterate Mitochondrial', 'Alternative Yeast
                        Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
                        SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
                        'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
                        Flatworm Mitochondrial'}. By default, "SGC1" is
                        assigned to mitochondrial chromosomes. (default: [])
  --start-codons [START_CODONS [START_CODONS ...]]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Arguments

-h, --help

show this help message and exit

--gvf <files> Path

File path to GVF files. All GVF files must be generated by moPepGen parsers. Valid formats: ['.gvf']

--variant-peptides <file> Path

File path to the variant peptide FASTA database file. Must be generated by moPepGen callVariant. Valid formats: ['.fasta', '.fa']

--novel-orf-peptides <file> Path

File path to the novel ORF peptide FASTA database file. Must be generated by moPepGen callNovelORF. Valid formats: ['.fasta', '.fa']

--alt-translation-peptides Path

File path to the alt translation peptide FASTA file. Must begenerated by moPepGen callAltTranslation. Valid formats: ['.fasta', '.fa']

--order-source <value> str

Order of sources, separate by comma. E.g., SNP,SNV,Fusion

-o, --output-path <file> Path

File path to the output file. If not given, the summary table is printed to stdout. Valid formats: ['.txt', 'tsv']

--group-source <value> str

Group sources. The peptides with sources grouped will be written to the same FASTA file. E.g., "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".

--output-image <file> Path

File path to the output barplot. Valid formats: ['.pdf', '.jpg', '.jpeg', '.png']

--ignore-missing-source

Ignore the sources missing from input GVF.
Default: False

--plot-normal-scale

Draw the summary bar plot in normal scale.
Default: False

--plot-log-scale

Draw the summary bar plot in log scale.
Default: False

-c, --cleavage-rule <value> str

Enzymatic cleavage rule. str
Default: trypsin
Choices: ['arg-c', 'asp-n', 'bnps-skatole', 'caspase 1', 'caspase 2', 'caspase 3', 'caspase 4', 'caspase 5', 'caspase 6', 'caspase 7', 'caspase 8', 'caspase 9', 'caspase 10', 'chymotrypsin high specificity', 'chymotrypsin low specificity', 'clostripain', 'cnbr', 'enterokinase', 'factor xa', 'formic acid', 'glutamyl endopeptidase', 'granzyme b', 'hydroxylamine', 'iodosobenzoic acid', 'lysc', 'lysn', 'ntcb', 'pepsin ph1.3', 'pepsin ph2.0', 'proline endopeptidase', 'proteinase k', 'staphylococcal peptidase i', 'thermolysin', 'thrombin', 'trypsin', 'trypsin_exception']

--cleavage-exception <value> str

Enzymatic cleavage exception. str
Default: auto

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'} str
Default: Standard
Choices: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False