splitFasta

splitFasta takes the FASTA file with variant peptide sequences called by callVariant with or without novel ORF peptides called by callNovelORF, and splits peptide sequences into databases. The split database FASTA files can be used for sequential library searching.

Usage

usage: moPepGen splitFasta [-h] [--gvf <files> [<files> ...]]
                           [--variant-peptides <file>]
                           [--novel-orf-peptides <file>]
                           [--alt-translation-peptides <file>] [-o <value>]
                           [--order-source <value>]
                           [--group-source [<value> [<value> ...]]]
                           [--max-source-groups <number>]
                           [--additional-split [<value> [<value> ...]]]
                           [-a <file>] [--reference-source {GENCODE,ENSEMBL}]
                           [-p <file>] [--invalid-protein-as-noncoding]
                           [--index-dir [<file>]]
                           [--debug-level <value|number>] [-q]

Split variant peptide FASTA database generated by moPepGen into separate
files.

optional arguments:
  -h, --help            show this help message and exit
  --gvf <files> [<files> ...]
                        File path to GVF files. All GVF files must be
                        generated by moPepGen parsers. Valid formats: ['.gvf']
                        (default: None)
  --variant-peptides <file>
                        File path to the variant peptide FASTA database file.
                        Must be generated by moPepGen callVariant. Valid
                        formats: ['.fasta', '.fa'] (default: None)
  --novel-orf-peptides <file>
                        File path to the novel ORF peptide FASTA database
                        file. Must be generated by moPepGen callNovelORF.
                        Valid formats: ['.fasta', '.fa'] (default: None)
  --alt-translation-peptides <file>
                        File path to the alt translation peptide FASTA file.
                        Must begenerated by moPepGen callAltTranslation. Valid
                        formats: ['.fasta', '.fa'] (default: None)
  -o <value>, --output-prefix <value>
                        Output prefix (default: None)
  --order-source <value>
                        Order of sources, separate by comma. E.g.,
                        SNP,SNV,Fusion (default: None)
  --group-source [<value> [<value> ...]]
                        Group sources. The peptides with sources grouped will
                        be written to the same FASTA file. E.g.,
                        "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".
                        (default: None)
  --max-source-groups <number>
                        Maximal number of different source groups to be
                        separate into individual database FASTA files.
                        Defaults to 1 (default: 1)
  --additional-split [<value> [<value> ...]]
                        For peptides that were not already split into FASTAs
                        up to max_source_groups, those involving the following
                        source will be split into additional FASTAs with
                        decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-
                        NovelORF gSNP-gINDEL' (default: None)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  -p <file>, --proteome-fasta <file>
                        Path to the translated protein sequence FASTA file.
                        Only ENSEMBL and GENCODE are supported. Its version
                        must be the same as genome FASTA and annotation GTF.
                        (default: None)
  --invalid-protein-as-noncoding
                        Treat any transcript that the protein sequence is
                        invalid ( contains the * symbol) as noncoding.
                        (default: False)
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Examples

Basic usage

A basic usage is below.

moPepGen splitFasta \
  --gvf \
    path/to/gSNP.gvf \
    path/to/gINDEL.gvf \
    path/to/reditools.gvf \
  --variant-peptides path/to/variant.fasta \
  --index-dir path/to/index \
  --max-source-groups 1 \
  --output-prefix path/to/split

The example above splits the variant peptide sequence database into four FASTA files, split_gSNP.fasta, split_gINDEL.fasta, split_RNAEditing.fasta, split_Remaining.fasta. Variant peptides are split into individual FASTA file of its variant source group based on the order of GVF files. Peptides with more than one variant sources are written into the *_Remaining.fasta because the --max-source-groups is set to 1.

Group sources

Sometimes we want to group variant sources together. See example below.

moPepGen splitFasta \
  --gvf \
    path/to/gSNP.gvf \
    path/to/gINDEL.gvf \
    path/to/reditools.gvf \
  --variant-peptides path/to/variant.fasta \
  --index-dir path/to/index \
  --group-source Coding:gSNP,gINDEL \
  --order-source Coding,RNAEditing \
  --max-source-groups 1 \
  --output-prefix path/to/split

This example outputs three split FASTA files, split_Coding.fasta, split_RNAEditing.fasta, and split_Remaining.fasta.

Additional split

Additional split allows you to split the records with the source group specified that would otherwise be placed along with all other records exceeding --max-source-groups into remaining.fasta. See the example below.

moPepGen splitFasta \
  --gvf \
    path/to/gSNP.gvf \
    path/to/gINDEL.gvf \
    path/to/reditools.gvf \
  --variant-peptides path/to/variant.fasta \
  --index-dir path/to/index \
  --max-source-groups 1 \
  --additional-split gSNP-gINDL gSNP-RNAEditing \
  --output-prefix path/to/split

As a result, split_gSNP.fasta and split_gINDEL.fasta will be written. split_gSNP-gINDEL.fasta and split_gSNP-RNAEditing.fasta will also be written although the number of variant sources (2) is larger than the value specified through --max-source-groups.

Arguments

-h, --help

show this help message and exit

--gvf <files> Path

File path to GVF files. All GVF files must be generated by moPepGen parsers. Valid formats: ['.gvf']

--variant-peptides <file> Path

File path to the variant peptide FASTA database file. Must be generated by moPepGen callVariant. Valid formats: ['.fasta', '.fa']

--novel-orf-peptides <file> Path

File path to the novel ORF peptide FASTA database file. Must be generated by moPepGen callNovelORF. Valid formats: ['.fasta', '.fa']

--alt-translation-peptides <file> Path

File path to the alt translation peptide FASTA file. Must begenerated by moPepGen callAltTranslation. Valid formats: ['.fasta', '.fa']

-o, --output-prefix <value> Path

Output prefix

--order-source <value> str

Order of sources, separate by comma. E.g., SNP,SNV,Fusion

--group-source <value> str

Group sources. The peptides with sources grouped will be written to the same FASTA file. E.g., "PointMutation:gSNP,sSNV INDEL:gINDEL,sINDEL".

--max-source-groups <number> int

Maximal number of different source groups to be separate into individual database FASTA files. Defaults to 1 int
Default: 1

--additional-split <value> str

For peptides that were not already split into FASTAs up to max_source_groups, those involving the following source will be split into additional FASTAs with decreasing priority. E.g., 'gSNP-NovelORF', 'gSNP-NovelORF gSNP-gINDEL'

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

-p, --proteome-fasta <file> Path

Path to the translated protein sequence FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as genome FASTA and annotation GTF.

--invalid-protein-as-noncoding

Treat any transcript that the protein sequence is invalid ( contains the * symbol) as noncoding.
Default: False

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False