splitFasta
splitFasta
takes the FASTA file with variant peptide sequences called by
callVariant
with or without novel ORF peptides called by
callNovelORF
, and splits peptide sequences into databases.
The split database FASTA files can be used for sequential library searching.
Usage
usage: moPepGen splitFasta [-h] [--gvf <files> [<files> ...]]
[--variant-peptides <file>]
[--novel-orf-peptides <file>]
[--alt-translation-peptides <file>] [-o <value>]
[--order-source <value>]
[--group-source [<value> [<value> ...]]]
[--max-source-groups <number>]
[--additional-split [<value> [<value> ...]]]
[-a <file>] [--reference-source {GENCODE,ENSEMBL}]
[-p <file>] [--invalid-protein-as-noncoding]
[--index-dir [<file>]]
[--debug-level <value|number>] [-q]
Split variant peptide FASTA database generated by moPepGen into separate
files.
optional arguments:
-h, --help show this help message and exit
--gvf <files> [<files> ...]
File path to GVF files. All GVF files must be
generated by moPepGen parsers. Valid formats: ['.gvf']
(default: None)
--variant-peptides <file>
File path to the variant peptide FASTA database file.
Must be generated by moPepGen callVariant. Valid
formats: ['.fasta', '.fa'] (default: None)
--novel-orf-peptides <file>
File path to the novel ORF peptide FASTA database
file. Must be generated by moPepGen callNovelORF.
Valid formats: ['.fasta', '.fa'] (default: None)
--alt-translation-peptides <file>
File path to the alt translation peptide FASTA file.
Must begenerated by moPepGen callAltTranslation. Valid
formats: ['.fasta', '.fa'] (default: None)
-o <value>, --output-prefix <value>
Output prefix (default: None)
--order-source <value>
Order of sources, separate by comma. E.g.,
SNP,SNV,Fusion (default: None)
--group-source [<value> [<value> ...]]
Group sources. The peptides with sources grouped will
be written to the same FASTA file. E.g.,
"PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
(default: None)
--max-source-groups <number>
Maximal number of different source groups to be
separate into individual database FASTA files.
Defaults to 1 (default: 1)
--additional-split [<value> [<value> ...]]
For peptides that were not already split into FASTAs
up to max_source_groups, those involving the following
source will be split into additional FASTAs with
decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-
NovelORF gSNP-gINDEL' (default: None)
--debug-level <value|number>
Debug level. (default: INFO)
-q, --quiet Quiet (default: False)
Reference Files:
-a <file>, --annotation-gtf <file>
Path to the annotation GTF file. Only ENSEMBL and
GENCODE are supported. Its version must be the same as
the genome and proteome FASTA. (default: None)
--reference-source {GENCODE,ENSEMBL}
Source of reference genome and annotation. (default:
None)
-p <file>, --proteome-fasta <file>
Path to the translated protein sequence FASTA file.
Only ENSEMBL and GENCODE are supported. Its version
must be the same as genome FASTA and annotation GTF.
(default: None)
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is
invalid ( contains the * symbol) as noncoding.
(default: False)
--index-dir [<file>] Path to the directory of index files generated by
moPepGen generateIndex. If given, --genome-fasta,
--proteome-fasta and --anntotation-gtf will be
ignored. (default: None)
Examples
Basic usage
A basic usage is below.
moPepGen splitFasta \
--gvf \
path/to/gSNP.gvf \
path/to/gINDEL.gvf \
path/to/reditools.gvf \
--variant-peptides path/to/variant.fasta \
--index-dir path/to/index \
--max-source-groups 1 \
--output-prefix path/to/split
The example above splits the variant peptide sequence database into four FASTA files, split_gSNP.fasta
, split_gINDEL.fasta
, split_RNAEditing.fasta
, split_Remaining.fasta
. Variant peptides are split into individual FASTA file of its variant source group based on the order of GVF files. Peptides with more than one variant sources are written into the *_Remaining.fasta
because the --max-source-groups
is set to 1.
Group sources
Sometimes we want to group variant sources together. See example below.
moPepGen splitFasta \
--gvf \
path/to/gSNP.gvf \
path/to/gINDEL.gvf \
path/to/reditools.gvf \
--variant-peptides path/to/variant.fasta \
--index-dir path/to/index \
--group-source Coding:gSNP,gINDEL \
--order-source Coding,RNAEditing \
--max-source-groups 1 \
--output-prefix path/to/split
This example outputs three split FASTA files, split_Coding.fasta
, split_RNAEditing.fasta
, and split_Remaining.fasta
.
Additional split
Additional split allows you to split the records with the source group specified that would otherwise be placed along with all other records exceeding --max-source-groups
into remaining.fasta
. See the example below.
moPepGen splitFasta \
--gvf \
path/to/gSNP.gvf \
path/to/gINDEL.gvf \
path/to/reditools.gvf \
--variant-peptides path/to/variant.fasta \
--index-dir path/to/index \
--max-source-groups 1 \
--additional-split gSNP-gINDL gSNP-RNAEditing \
--output-prefix path/to/split
As a result, split_gSNP.fasta
and split_gINDEL.fasta
will be written. split_gSNP-gINDEL.fasta
and split_gSNP-RNAEditing.fasta
will also be written although the number of variant sources (2) is larger than the value specified through --max-source-groups
.
Arguments
-h, --help
show this help message and exit
--gvf <files> Path
File path to GVF files. All GVF files must be generated by moPepGen parsers. Valid formats: ['.gvf']
--variant-peptides <file> Path
File path to the variant peptide FASTA database file. Must be generated by moPepGen callVariant. Valid formats: ['.fasta', '.fa']
--novel-orf-peptides <file> Path
File path to the novel ORF peptide FASTA database file. Must be generated by moPepGen callNovelORF. Valid formats: ['.fasta', '.fa']
--alt-translation-peptides <file> Path
File path to the alt translation peptide FASTA file. Must begenerated by moPepGen callAltTranslation. Valid formats: ['.fasta', '.fa']
-o, --output-prefix <value> Path
Output prefix
--order-source <value> str
Order of sources, separate by comma. E.g., SNP,SNV,Fusion
--group-source <value> str
Group sources. The peptides with sources grouped will be written to the same FASTA file. E.g., "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
--max-source-groups <number> int
Maximal number of different source groups to be separate into individual database FASTA files. Defaults to 1
int
Default: 1
--additional-split <value> str
For peptides that were not already split into FASTAs up to max_source_groups, those involving the following source will be split into additional FASTAs with decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-NovelORF gSNP-gINDEL'
-a, --annotation-gtf <file> Path
Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.
--reference-source str
Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']
-p, --proteome-fasta <file> Path
Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.
--invalid-protein-as-noncoding
Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False
--index-dir <file> Path
Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.
--debug-level <value|number> str
Debug level.
str
Default: INFO
-q, --quiet
Quiet
Default: False