splitFasta
splitFasta
takes the FASTA file with variant peptide sequences called by
callVariant
with or without novel ORF peptides called by
callNovelORF
, and splits peptide sequences into databases.
The split database FASTA files can be used for sequential library searching.
Usage
usage: moPepGen splitFasta [-h] [--gvf <files> [<files> ...]]
[--variant-peptides <file>]
[--novel-orf-peptides <file>]
[--alt-translation-peptides <file>] [-o <value>]
[--order-source <value>]
[--group-source [<value> [<value> ...]]]
[--max-source-groups <number>]
[--additional-split [<value> [<value> ...]]]
[-a <file>] [--reference-source {GENCODE,ENSEMBL}]
[--codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}]
[--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
[--start-codons [START_CODONS [START_CODONS ...]]]
[--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
[-p <file>] [--invalid-protein-as-noncoding]
[--index-dir [<file>]]
[--debug-level <value|number>] [-q]
Split variant peptide FASTA database generated by moPepGen into separate
files.
optional arguments:
-h, --help show this help message and exit
--gvf <files> [<files> ...]
File path to GVF files. All GVF files must be
generated by moPepGen parsers. Valid formats: ['.gvf']
(default: None)
--variant-peptides <file>
File path to the variant peptide FASTA database file.
Must be generated by moPepGen callVariant. Valid
formats: ['.fasta', '.fa'] (default: None)
--novel-orf-peptides <file>
File path to the novel ORF peptide FASTA database
file. Must be generated by moPepGen callNovelORF.
Valid formats: ['.fasta', '.fa'] (default: None)
--alt-translation-peptides <file>
File path to the alt translation peptide FASTA file.
Must begenerated by moPepGen callAltTranslation. Valid
formats: ['.fasta', '.fa'] (default: None)
-o <value>, --output-prefix <value>
Output prefix (default: None)
--order-source <value>
Order of sources, separate by comma (e.g.,
SNP,SNV,Fusion). Whildcard characters are supported.
"SNV-*" will match all peptides with SNV with or
without other variant sources. "SNV-+" will match all
peptides with SNV with at least another variant
source. (default: None)
--group-source [<value> [<value> ...]]
Group sources. The peptides with sources grouped will
be written to the same FASTA file. E.g.,
"PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
(default: None)
--max-source-groups <number>
Maximal number of different source groups to be
separate into individual database FASTA files.
Defaults to 1 (default: 1)
--additional-split [<value> [<value> ...]]
For peptides that were not already split into FASTAs
up to max_source_groups, those involving the following
source will be split into additional FASTAs with
decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-
NovelORF gSNP-gINDEL' (default: None)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Reference Files:
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
--codon-table {SGC5,Hexamita Nuclear,Euplotid Nuclear,Ascidian Mitochondrial,Thraustochytrium Mitochondrial,Standard,Gracilibacteria,SGC8,Condylostoma Nuclear,Flatworm Mitochondrial,SGC1,Protozoan Mitochondrial,SGC3,Pachysolen tannophilus Nuclear,Archaeal,Chlorophycean Mitochondrial,Pterobranchia Mitochondrial,Echinoderm Mitochondrial,Blastocrithidia Nuclear,Invertebrate Mitochondrial,SGC0,Mycoplasma,Yeast Mitochondrial,Mesodinium Nuclear,Vertebrate Mitochondrial,Bacterial,Trematode Mitochondrial,SGC4,Balanophoraceae Plastid,Blepharisma Macronuclear,Mold Mitochondrial,Karyorelict Nuclear,Spiroplasma,SGC9,Scenedesmus obliquus Mitochondrial,Plant Plastid,Coelenterate Mitochondrial,Alternative Yeast Nuclear,Dasycladacean Nuclear,Candidate Division SR1,SGC2,Cephalodiscidae Mitochondrial,Peritrich Nuclear,Ciliate Nuclear,Alternative Flatworm Mitochondrial}
Codon table. Defaults to "Standard". Supported codon
tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid
Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium
Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
'Condylostoma Nuclear', 'Flatworm Mitochondrial',
'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
tannophilus Nuclear', 'Archaeal', 'Chlorophycean
Mitochondrial', 'Pterobranchia Mitochondrial',
'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
'Yeast Mitochondrial', 'Mesodinium Nuclear',
'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
'Blepharisma Macronuclear', 'Mold Mitochondrial',
'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
'Coelenterate Mitochondrial', 'Alternative Yeast
Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
Flatworm Mitochondrial'} (default: Standard)
--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
Chromosome specific codon table. Must be specified in
the format of "chrM:SGC1", where "chrM" is the
chromosome name and "SGC1" is the codon table to use
to translate genes on chrM. Supported codon tables:
{'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear',
'Ascidian Mitochondrial', 'Thraustochytrium
Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8',
'Condylostoma Nuclear', 'Flatworm Mitochondrial',
'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen
tannophilus Nuclear', 'Archaeal', 'Chlorophycean
Mitochondrial', 'Pterobranchia Mitochondrial',
'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear',
'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma',
'Yeast Mitochondrial', 'Mesodinium Nuclear',
'Vertebrate Mitochondrial', 'Bacterial', 'Trematode
Mitochondrial', 'SGC4', 'Balanophoraceae Plastid',
'Blepharisma Macronuclear', 'Mold Mitochondrial',
'Karyorelict Nuclear', 'Spiroplasma', 'SGC9',
'Scenedesmus obliquus Mitochondrial', 'Plant Plastid',
'Coelenterate Mitochondrial', 'Alternative Yeast
Nuclear', 'Dasycladacean Nuclear', 'Candidate Division
SR1', 'SGC2', 'Cephalodiscidae Mitochondrial',
'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative
Flatworm Mitochondrial'}. By default, "SGC1" is
assigned to mitochondrial chromosomes. (default: [])
--start-codons [START_CODONS [START_CODONS ...]]
Default start codon(s) to use for novel ORF
translation. Defaults to ["ATG"]. (default: ['ATG'])
--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
Chromosome specific start codon(s). For example,
"chrM:ATG,ATA,ATT".By defualt, mitochondrial
chromosome name is automatically inferred andstart
codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
assigned to it. (default: [])
-p <file>, --proteome-fasta <file>
Path to the translated protein sequence FASTA file.
Only ENSEMBL and GENCODE are supported. Its version
must be the same as genome FASTA and annotation GTF.
(default: None)
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is
invalid ( contains the * symbol) as noncoding.
(default: False)
--index-dir [<file>] Path to the directory of index files generated by
moPepGen generateIndex. If given, --genome-fasta,
--proteome-fasta and --anntotation-gtf will be
ignored. (default: None)
Examples
Basic usage
A basic usage is below.
moPepGen splitFasta \
--gvf \
path/to/gSNP.gvf \
path/to/gINDEL.gvf \
path/to/reditools.gvf \
--variant-peptides path/to/variant.fasta \
--index-dir path/to/index \
--max-source-groups 1 \
--output-prefix path/to/split
The example above splits the variant peptide sequence database into four FASTA files, split_gSNP.fasta
, split_gINDEL.fasta
, split_RNAEditing.fasta
, split_Remaining.fasta
. Variant peptides are split into individual FASTA file of its variant source group based on the order of GVF files. Peptides with more than one variant sources are written into the *_Remaining.fasta
because the --max-source-groups
is set to 1.
Group sources
Sometimes we want to group variant sources together. See example below.
moPepGen splitFasta \
--gvf \
path/to/gSNP.gvf \
path/to/gINDEL.gvf \
path/to/reditools.gvf \
--variant-peptides path/to/variant.fasta \
--index-dir path/to/index \
--group-source Coding:gSNP,gINDEL \
--order-source Coding,RNAEditing \
--max-source-groups 1 \
--output-prefix path/to/split
This example outputs three split FASTA files, split_Coding.fasta
, split_RNAEditing.fasta
, and split_Remaining.fasta
.
Additional split
Additional split allows you to split the records with the source group specified that would otherwise be placed along with all other records exceeding --max-source-groups
into remaining.fasta
. See the example below.
moPepGen splitFasta \
--gvf \
path/to/gSNP.gvf \
path/to/gINDEL.gvf \
path/to/reditools.gvf \
--variant-peptides path/to/variant.fasta \
--index-dir path/to/index \
--max-source-groups 1 \
--additional-split gSNP-gINDL gSNP-RNAEditing \
--output-prefix path/to/split
As a result, split_gSNP.fasta
and split_gINDEL.fasta
will be written. split_gSNP-gINDEL.fasta
and split_gSNP-RNAEditing.fasta
will also be written although the number of variant sources (2) is larger than the value specified through --max-source-groups
.
Arguments
-h, --help
show this help message and exit
--gvf <files> Path
File path to GVF files. All GVF files must be generated by moPepGen parsers. Valid formats: ['.gvf']
--variant-peptides <file> Path
File path to the variant peptide FASTA database file. Must be generated by moPepGen callVariant. Valid formats: ['.fasta', '.fa']
--novel-orf-peptides <file> Path
File path to the novel ORF peptide FASTA database file. Must be generated by moPepGen callNovelORF. Valid formats: ['.fasta', '.fa']
--alt-translation-peptides <file> Path
File path to the alt translation peptide FASTA file. Must begenerated by moPepGen callAltTranslation. Valid formats: ['.fasta', '.fa']
-o, --output-prefix <value> Path
Output prefix
--order-source <value> str
Order of sources, separate by comma (e.g., SNP,SNV,Fusion). Whildcard characters are supported. "SNV-*" will match all peptides with SNV with or without other variant sources. "SNV-+" will match all peptides with SNV with at least another variant source.
--group-source <value> str
Group sources. The peptides with sources grouped will be written to the same FASTA file. E.g., "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
--max-source-groups <number> int
Maximal number of different source groups to be separate into individual database FASTA files. Defaults to 1
int
Default: 1
--additional-split <value> str
For peptides that were not already split into FASTAs up to max_source_groups, those involving the following source will be split into additional FASTAs with decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-NovelORF gSNP-gINDEL'
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
--codon-table str
Codon table. Defaults to "Standard". Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}
str
Default: Standard
Choices: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}
--chr-codon-table str
Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'SGC5', 'Hexamita Nuclear', 'Euplotid Nuclear', 'Ascidian Mitochondrial', 'Thraustochytrium Mitochondrial', 'Standard', 'Gracilibacteria', 'SGC8', 'Condylostoma Nuclear', 'Flatworm Mitochondrial', 'SGC1', 'Protozoan Mitochondrial', 'SGC3', 'Pachysolen tannophilus Nuclear', 'Archaeal', 'Chlorophycean Mitochondrial', 'Pterobranchia Mitochondrial', 'Echinoderm Mitochondrial', 'Blastocrithidia Nuclear', 'Invertebrate Mitochondrial', 'SGC0', 'Mycoplasma', 'Yeast Mitochondrial', 'Mesodinium Nuclear', 'Vertebrate Mitochondrial', 'Bacterial', 'Trematode Mitochondrial', 'SGC4', 'Balanophoraceae Plastid', 'Blepharisma Macronuclear', 'Mold Mitochondrial', 'Karyorelict Nuclear', 'Spiroplasma', 'SGC9', 'Scenedesmus obliquus Mitochondrial', 'Plant Plastid', 'Coelenterate Mitochondrial', 'Alternative Yeast Nuclear', 'Dasycladacean Nuclear', 'Candidate Division SR1', 'SGC2', 'Cephalodiscidae Mitochondrial', 'Peritrich Nuclear', 'Ciliate Nuclear', 'Alternative Flatworm Mitochondrial'}. By default, "SGC1" is assigned to mitochondrial chromosomes.
str
Default: []
--start-codons str
Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"].
str
Default: ['ATG']
--chr-start-codons str
Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it.
str
Default: []
-p, --proteome-fasta <file> Path
Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False
--index-dir <file> Path
Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False