splitFasta

splitFasta takes the FASTA file with variant peptide sequences called by callVariant with or without novel ORF peptides called by callNovelORF, and splits peptide sequences into databases. The split database FASTA files can be used for sequential library searching.

Usage

usage: moPepGen splitFasta [-h] [--gvf <files> [<files> ...]]
                           [--variant-peptides <file>]
                           [--novel-orf-peptides <file>]
                           [--alt-translation-peptides <file>] [-o <value>]
                           [--order-source <value>]
                           [--group-source [<value> ...]]
                           [--max-source-groups <number>]
                           [--additional-split [<value> ...]] [-a <file>]
                           [--reference-source {GENCODE,ENSEMBL}]
                           [--codon-table {Standard,Alternative Flatworm Mitochondrial,Trematode Mitochondrial,SGC8,SGC3,Protozoan Mitochondrial,Ciliate Nuclear,Gracilibacteria,Spiroplasma,Dasycladacean Nuclear,Invertebrate Mitochondrial,Balanophoraceae Plastid,Peritrich Nuclear,Mesodinium Nuclear,SGC5,Candidate Division SR1,Blastocrithidia Nuclear,SGC1,Bacterial,Alternative Yeast Nuclear,Yeast Mitochondrial,Scenedesmus obliquus Mitochondrial,Plant Plastid,Flatworm Mitochondrial,SGC2,Archaeal,Mycoplasma,Euplotid Nuclear,SGC9,Mold Mitochondrial,Thraustochytrium Mitochondrial,Hexamita Nuclear,Coelenterate Mitochondrial,Chlorophycean Mitochondrial,Pachysolen tannophilus Nuclear,Ascidian Mitochondrial,SGC0,Blepharisma Macronuclear,Karyorelict Nuclear,SGC4,Echinoderm Mitochondrial,Condylostoma Nuclear,Vertebrate Mitochondrial,Pterobranchia Mitochondrial,Cephalodiscidae Mitochondrial}]
                           [--chr-codon-table [CHR_CODON_TABLE ...]]
                           [--start-codons [START_CODONS ...]]
                           [--chr-start-codons [CHR_START_CODONS ...]]
                           [-p <file>] [--invalid-protein-as-noncoding]
                           [--index-dir [<file>]]
                           [--debug-level <value|number>] [-q]

Split variant peptide FASTA database generated by moPepGen into separate
files.

options:
  -h, --help            show this help message and exit
  --gvf <files> [<files> ...]
                        File path to GVF files. All GVF files must be
                        generated by moPepGen parsers. Valid formats: ['.gvf']
                        (default: None)
  --variant-peptides <file>
                        File path to the variant peptide FASTA database file.
                        Must be generated by moPepGen callVariant. Valid
                        formats: ['.fasta', '.fa'] (default: None)
  --novel-orf-peptides <file>
                        File path to the novel ORF peptide FASTA database
                        file. Must be generated by moPepGen callNovelORF.
                        Valid formats: ['.fasta', '.fa'] (default: None)
  --alt-translation-peptides <file>
                        File path to the alt translation peptide FASTA file.
                        Must begenerated by moPepGen callAltTranslation. Valid
                        formats: ['.fasta', '.fa'] (default: None)
  -o <value>, --output-prefix <value>
                        Output prefix (default: None)
  --order-source <value>
                        Order of sources, separate by comma (e.g.,
                        SNP,SNV,Fusion). Whildcard characters are supported.
                        "SNV-*" will match all peptides with SNV with or
                        without other variant sources. "SNV-+" will match all
                        peptides with SNV with at least another variant
                        source. (default: None)
  --group-source [<value> ...]
                        Group sources. The peptides with sources grouped will
                        be written to the same FASTA file. E.g.,
                        "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
                        (default: None)
  --max-source-groups <number>
                        Maximal number of different source groups to be
                        separate into individual database FASTA files.
                        Defaults to 1 (default: 1)
  --additional-split [<value> ...]
                        For peptides that were not already split into FASTAs
                        up to max_source_groups, those involving the following
                        source will be split into additional FASTAs with
                        decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-
                        NovelORF gSNP-gINDEL' (default: None)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {Standard,Alternative Flatworm Mitochondrial,Trematode Mitochondrial,SGC8,SGC3,Protozoan Mitochondrial,Ciliate Nuclear,Gracilibacteria,Spiroplasma,Dasycladacean Nuclear,Invertebrate Mitochondrial,Balanophoraceae Plastid,Peritrich Nuclear,Mesodinium Nuclear,SGC5,Candidate Division SR1,Blastocrithidia Nuclear,SGC1,Bacterial,Alternative Yeast Nuclear,Yeast Mitochondrial,Scenedesmus obliquus Mitochondrial,Plant Plastid,Flatworm Mitochondrial,SGC2,Archaeal,Mycoplasma,Euplotid Nuclear,SGC9,Mold Mitochondrial,Thraustochytrium Mitochondrial,Hexamita Nuclear,Coelenterate Mitochondrial,Chlorophycean Mitochondrial,Pachysolen tannophilus Nuclear,Ascidian Mitochondrial,SGC0,Blepharisma Macronuclear,Karyorelict Nuclear,SGC4,Echinoderm Mitochondrial,Condylostoma Nuclear,Vertebrate Mitochondrial,Pterobranchia Mitochondrial,Cephalodiscidae Mitochondrial}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'Standard', 'Alternative Flatworm
                        Mitochondrial', 'Trematode Mitochondrial', 'SGC8',
                        'SGC3', 'Protozoan Mitochondrial', 'Ciliate Nuclear',
                        'Gracilibacteria', 'Spiroplasma', 'Dasycladacean
                        Nuclear', 'Invertebrate Mitochondrial',
                        'Balanophoraceae Plastid', 'Peritrich Nuclear',
                        'Mesodinium Nuclear', 'SGC5', 'Candidate Division
                        SR1', 'Blastocrithidia Nuclear', 'SGC1', 'Bacterial',
                        'Alternative Yeast Nuclear', 'Yeast Mitochondrial',
                        'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
                        'Flatworm Mitochondrial', 'SGC2', 'Archaeal',
                        'Mycoplasma', 'Euplotid Nuclear', 'SGC9', 'Mold
                        Mitochondrial', 'Thraustochytrium Mitochondrial',
                        'Hexamita Nuclear', 'Coelenterate Mitochondrial',
                        'Chlorophycean Mitochondrial', 'Pachysolen tannophilus
                        Nuclear', 'Ascidian Mitochondrial', 'SGC0',
                        'Blepharisma Macronuclear', 'Karyorelict Nuclear',
                        'SGC4', 'Echinoderm Mitochondrial', 'Condylostoma
                        Nuclear', 'Vertebrate Mitochondrial', 'Pterobranchia
                        Mitochondrial', 'Cephalodiscidae Mitochondrial'}
                        (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE ...]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'Standard', 'Alternative Flatworm Mitochondrial',
                        'Trematode Mitochondrial', 'SGC8', 'SGC3', 'Protozoan
                        Mitochondrial', 'Ciliate Nuclear', 'Gracilibacteria',
                        'Spiroplasma', 'Dasycladacean Nuclear', 'Invertebrate
                        Mitochondrial', 'Balanophoraceae Plastid', 'Peritrich
                        Nuclear', 'Mesodinium Nuclear', 'SGC5', 'Candidate
                        Division SR1', 'Blastocrithidia Nuclear', 'SGC1',
                        'Bacterial', 'Alternative Yeast Nuclear', 'Yeast
                        Mitochondrial', 'Scenedesmus obliquus Mitochondrial',
                        'Plant Plastid', 'Flatworm Mitochondrial', 'SGC2',
                        'Archaeal', 'Mycoplasma', 'Euplotid Nuclear', 'SGC9',
                        'Mold Mitochondrial', 'Thraustochytrium
                        Mitochondrial', 'Hexamita Nuclear', 'Coelenterate
                        Mitochondrial', 'Chlorophycean Mitochondrial',
                        'Pachysolen tannophilus Nuclear', 'Ascidian
                        Mitochondrial', 'SGC0', 'Blepharisma Macronuclear',
                        'Karyorelict Nuclear', 'SGC4', 'Echinoderm
                        Mitochondrial', 'Condylostoma Nuclear', 'Vertebrate
                        Mitochondrial', 'Pterobranchia Mitochondrial',
                        'Cephalodiscidae Mitochondrial'}. By default, "SGC1"
                        is assigned to mitochondrial chromosomes. (default:
                        [])
  --start-codons [START_CODONS ...]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS ...]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Examples

Basic usage

A basic usage is below.

moPepGen splitFasta \
  --gvf \
    path/to/gSNP.gvf \
    path/to/gINDEL.gvf \
    path/to/reditools.gvf \
  --variant-peptides path/to/variant.fasta \
  --index-dir path/to/index \
  --max-source-groups 1 \
  --output-prefix path/to/split

The example above splits the variant peptide sequence database into four FASTA files, split_gSNP.fasta, split_gINDEL.fasta, split_RNAEditing.fasta, split_Remaining.fasta. Variant peptides are split into individual FASTA file of its variant source group based on the order of GVF files. Peptides with more than one variant sources are written into the *_Remaining.fasta because the --max-source-groups is set to 1.

Group sources

Sometimes we want to group variant sources together. See example below.

moPepGen splitFasta \
  --gvf \
    path/to/gSNP.gvf \
    path/to/gINDEL.gvf \
    path/to/reditools.gvf \
  --variant-peptides path/to/variant.fasta \
  --index-dir path/to/index \
  --group-source Coding:gSNP,gINDEL \
  --order-source Coding,RNAEditing \
  --max-source-groups 1 \
  --output-prefix path/to/split

This example outputs three split FASTA files, split_Coding.fasta, split_RNAEditing.fasta, and split_Remaining.fasta.

Additional split

Additional split allows you to split the records with the source group specified that would otherwise be placed along with all other records exceeding --max-source-groups into remaining.fasta. See the example below.

moPepGen splitFasta \
  --gvf \
    path/to/gSNP.gvf \
    path/to/gINDEL.gvf \
    path/to/reditools.gvf \
  --variant-peptides path/to/variant.fasta \
  --index-dir path/to/index \
  --max-source-groups 1 \
  --additional-split gSNP-gINDL gSNP-RNAEditing \
  --output-prefix path/to/split

As a result, split_gSNP.fasta and split_gINDEL.fasta will be written. split_gSNP-gINDEL.fasta and split_gSNP-RNAEditing.fasta will also be written although the number of variant sources (2) is larger than the value specified through --max-source-groups.

Arguments

-h, --help

show this help message and exit

--gvf <files> Path

File path to GVF files. All GVF files must be generated by moPepGen parsers. Valid formats: ['.gvf']

--variant-peptides <file> Path

File path to the variant peptide FASTA database file. Must be generated by moPepGen callVariant. Valid formats: ['.fasta', '.fa']

--novel-orf-peptides <file> Path

File path to the novel ORF peptide FASTA database file. Must be generated by moPepGen callNovelORF. Valid formats: ['.fasta', '.fa']

--alt-translation-peptides <file> Path

File path to the alt translation peptide FASTA file. Must begenerated by moPepGen callAltTranslation. Valid formats: ['.fasta', '.fa']

-o, --output-prefix <value> Path

Output prefix

--order-source <value> str

Order of sources, separate by comma (e.g., SNP,SNV,Fusion). Whildcard characters are supported. "SNV-*" will match all peptides with SNV with or without other variant sources. "SNV-+" will match all peptides with SNV with at least another variant source.

--group-source <value> str

Group sources. The peptides with sources grouped will be written to the same FASTA file. E.g., "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".

--max-source-groups <number> int

Maximal number of different source groups to be separate into individual database FASTA files. Defaults to 1 int
Default: 1

--additional-split <value> str

For peptides that were not already split into FASTAs up to max_source_groups, those involving the following source will be split into additional FASTAs with decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-NovelORF gSNP-gINDEL'

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'Standard', 'Alternative Flatworm Mitochondrial', 'Trematode Mitochondrial', 'SGC8', 'SGC3', 'Protozoan Mitochondrial', 'Ciliate Nuclear', 'Gracilibacteria', 'Spiroplasma', 'Dasycladacean Nuclear', 'Invertebrate Mitochondrial', 'Balanophoraceae Plastid', 'Peritrich Nuclear', 'Mesodinium Nuclear', 'SGC5', 'Candidate Division SR1', 'Blastocrithidia Nuclear', 'SGC1', 'Bacterial', 'Alternative Yeast Nuclear', 'Yeast Mitochondrial', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Flatworm Mitochondrial', 'SGC2', 'Archaeal', 'Mycoplasma', 'Euplotid Nuclear', 'SGC9', 'Mold Mitochondrial', 'Thraustochytrium Mitochondrial', 'Hexamita Nuclear', 'Coelenterate Mitochondrial', 'Chlorophycean Mitochondrial', 'Pachysolen tannophilus Nuclear', 'Ascidian Mitochondrial', 'SGC0', 'Blepharisma Macronuclear', 'Karyorelict Nuclear', 'SGC4', 'Echinoderm Mitochondrial', 'Condylostoma Nuclear', 'Vertebrate Mitochondrial', 'Pterobranchia Mitochondrial', 'Cephalodiscidae Mitochondrial'} str
Default: Standard
Choices: {'Standard', 'Alternative Flatworm Mitochondrial', 'Trematode Mitochondrial', 'SGC8', 'SGC3', 'Protozoan Mitochondrial', 'Ciliate Nuclear', 'Gracilibacteria', 'Spiroplasma', 'Dasycladacean Nuclear', 'Invertebrate Mitochondrial', 'Balanophoraceae Plastid', 'Peritrich Nuclear', 'Mesodinium Nuclear', 'SGC5', 'Candidate Division SR1', 'Blastocrithidia Nuclear', 'SGC1', 'Bacterial', 'Alternative Yeast Nuclear', 'Yeast Mitochondrial', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Flatworm Mitochondrial', 'SGC2', 'Archaeal', 'Mycoplasma', 'Euplotid Nuclear', 'SGC9', 'Mold Mitochondrial', 'Thraustochytrium Mitochondrial', 'Hexamita Nuclear', 'Coelenterate Mitochondrial', 'Chlorophycean Mitochondrial', 'Pachysolen tannophilus Nuclear', 'Ascidian Mitochondrial', 'SGC0', 'Blepharisma Macronuclear', 'Karyorelict Nuclear', 'SGC4', 'Echinoderm Mitochondrial', 'Condylostoma Nuclear', 'Vertebrate Mitochondrial', 'Pterobranchia Mitochondrial', 'Cephalodiscidae Mitochondrial'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'Standard', 'Alternative Flatworm Mitochondrial', 'Trematode Mitochondrial', 'SGC8', 'SGC3', 'Protozoan Mitochondrial', 'Ciliate Nuclear', 'Gracilibacteria', 'Spiroplasma', 'Dasycladacean Nuclear', 'Invertebrate Mitochondrial', 'Balanophoraceae Plastid', 'Peritrich Nuclear', 'Mesodinium Nuclear', 'SGC5', 'Candidate Division SR1', 'Blastocrithidia Nuclear', 'SGC1', 'Bacterial', 'Alternative Yeast Nuclear', 'Yeast Mitochondrial', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Flatworm Mitochondrial', 'SGC2', 'Archaeal', 'Mycoplasma', 'Euplotid Nuclear', 'SGC9', 'Mold Mitochondrial', 'Thraustochytrium Mitochondrial', 'Hexamita Nuclear', 'Coelenterate Mitochondrial', 'Chlorophycean Mitochondrial', 'Pachysolen tannophilus Nuclear', 'Ascidian Mitochondrial', 'SGC0', 'Blepharisma Macronuclear', 'Karyorelict Nuclear', 'SGC4', 'Echinoderm Mitochondrial', 'Condylostoma Nuclear', 'Vertebrate Mitochondrial', 'Pterobranchia Mitochondrial', 'Cephalodiscidae Mitochondrial'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False