parseRMATS

parseRMATS takes the alternative splicing event data called by rMATS and converts them to a GVF file. All five alternative splicing events are supported, including skipped exons, alternative 5 splicing, alternative 3 splicing, mutually exclusive exons, and retained introns. Both the tsv files with JC or JCEC suffix are supported. The created GVF file can be then used to call for variant peptides using callVariant

Reference Version

The version of reference genome and proteome FASTA and annotation GTF MUST be consistent across all analysis.

Usage

usage: moPepGen parseRMATS [-h] [--se <file>] [--a5ss <file>] [--a3ss <file>]
                           [--mxe <file>] [--ri <file>] [--min-ijc MIN_IJC]
                           [--min-sjc MIN_SJC] -o <file> --source SOURCE
                           [-g <file>] [-a <file>]
                           [--reference-source {GENCODE,ENSEMBL}]
                           [--codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}]
                           [--chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]]
                           [--start-codons [START_CODONS [START_CODONS ...]]]
                           [--chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]]
                           [--index-dir [<file>]]
                           [--debug-level <value|number>] [-q]

Parse the rMATS result to GVF format of variant records for moPepGen to call
variant peptides.

optional arguments:
  -h, --help            show this help message and exit
  --se <file>           File path to the SE (skipped exons) junction count
                        file output by rMATS. The file name should look like
                        '*_SE.MATS.JC.txt' or '*_SE.MATS.JCEC.txt'. Valid
                        formats: ['.tsv', '.txt'] (default: None)
  --a5ss <file>         File path to the A5SS (alternative 5' splicint site)
                        junction count file output by rMATS. The file name
                        should look like '_S5SS.MATS.JC.txt' or
                        '*_A5SS.MATS.JCEC.txt'. Valid formats: ['.tsv',
                        '.txt'] (default: None)
  --a3ss <file>         File path to the A3SS (alternative 3' splicint site)
                        junction count file output by rMATS. The file name
                        should look like '_S3SS.MATS.JC.txt' or
                        '*_A3SS.MATS.JCEC.txt'. Valid formats: ['.tsv',
                        '.txt'] (default: None)
  --mxe <file>          File path to the MXE (mutually exclusive exons)
                        junction count file output by rMATS. The file name
                        should look like '_MXE.MATS.JC.txt' or
                        '*_MXE.MATS.JCEC.txt'. Valid formats: ['.tsv', '.txt']
                        (default: None)
  --ri <file>           File path to the RI (retained intron) junction count
                        file output by rMATS. The file name should look like
                        '_RI.MATS.JC.txt' or '*_RI.MATS.JCEC.txt'. Valid
                        formats: ['.tsv', '.txt'] (default: None)
  --min-ijc MIN_IJC     Minimal junction read count for the inclusion version
                        to be analyzed. (default: 1)
  --min-sjc MIN_SJC     Minimal junction read count for the skipped version to
                        be analyzed. (default: 1)
  -o <file>, --output-path <file>
                        File path to the output file. Valid formats: ['.gvf']
                        (default: None)
  --source SOURCE       Variant source (e.g. gSNP, sSNV, Fusion) (default:
                        None)
  --debug-level <value|number>
                        Debug level. (default: INFO)
  -q, --quiet           Quiet (default: False)

Reference Files:
  -g <file>, --genome-fasta <file>
                        Path to the genome assembly FASTA file. Only ENSEMBL
                        and GENCODE are supported. Its version must be the
                        same as the annotation GTF and proteome FASTA
                        (default: None)
  -a <file>, --annotation-gtf <file>
                        Path to the annotation GTF file. Only ENSEMBL and
                        GENCODE are supported. Its version must be the same as
                        the genome and proteome FASTA. (default: None)
  --reference-source {GENCODE,ENSEMBL}
                        Source of reference genome and annotation. (default:
                        None)
  --codon-table {Alternative Yeast Nuclear,Protozoan Mitochondrial,Vertebrate Mitochondrial,Blepharisma Macronuclear,Chlorophycean Mitochondrial,Ascidian Mitochondrial,Ciliate Nuclear,Mesodinium Nuclear,Balanophoraceae Plastid,SGC9,Cephalodiscidae Mitochondrial,Trematode Mitochondrial,Pachysolen tannophilus Nuclear,SGC2,Yeast Mitochondrial,SGC5,Euplotid Nuclear,Scenedesmus obliquus Mitochondrial,Peritrich Nuclear,Archaeal,Coelenterate Mitochondrial,Bacterial,Mold Mitochondrial,SGC3,Hexamita Nuclear,Pterobranchia Mitochondrial,Plant Plastid,Condylostoma Nuclear,Blastocrithidia Nuclear,Gracilibacteria,Alternative Flatworm Mitochondrial,Echinoderm Mitochondrial,Invertebrate Mitochondrial,SGC0,Candidate Division SR1,Dasycladacean Nuclear,SGC4,Flatworm Mitochondrial,SGC8,Thraustochytrium Mitochondrial,SGC1,Spiroplasma,Mycoplasma,Standard,Karyorelict Nuclear}
                        Codon table. Defaults to "Standard". Supported codon
                        tables: {'Alternative Yeast Nuclear', 'Protozoan
                        Mitochondrial', 'Vertebrate Mitochondrial',
                        'Blepharisma Macronuclear', 'Chlorophycean
                        Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
                        Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
                        Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
                        'Trematode Mitochondrial', 'Pachysolen tannophilus
                        Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
                        'Euplotid Nuclear', 'Scenedesmus obliquus
                        Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
                        'Coelenterate Mitochondrial', 'Bacterial', 'Mold
                        Mitochondrial', 'SGC3', 'Hexamita Nuclear',
                        'Pterobranchia Mitochondrial', 'Plant Plastid',
                        'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
                        'Gracilibacteria', 'Alternative Flatworm
                        Mitochondrial', 'Echinoderm Mitochondrial',
                        'Invertebrate Mitochondrial', 'SGC0', 'Candidate
                        Division SR1', 'Dasycladacean Nuclear', 'SGC4',
                        'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
                        Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
                        'Standard', 'Karyorelict Nuclear'} (default: Standard)
  --chr-codon-table [CHR_CODON_TABLE [CHR_CODON_TABLE ...]]
                        Chromosome specific codon table. Must be specified in
                        the format of "chrM:SGC1", where "chrM" is the
                        chromosome name and "SGC1" is the codon table to use
                        to translate genes on chrM. Supported codon tables:
                        {'Alternative Yeast Nuclear', 'Protozoan
                        Mitochondrial', 'Vertebrate Mitochondrial',
                        'Blepharisma Macronuclear', 'Chlorophycean
                        Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate
                        Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae
                        Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial',
                        'Trematode Mitochondrial', 'Pachysolen tannophilus
                        Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5',
                        'Euplotid Nuclear', 'Scenedesmus obliquus
                        Mitochondrial', 'Peritrich Nuclear', 'Archaeal',
                        'Coelenterate Mitochondrial', 'Bacterial', 'Mold
                        Mitochondrial', 'SGC3', 'Hexamita Nuclear',
                        'Pterobranchia Mitochondrial', 'Plant Plastid',
                        'Condylostoma Nuclear', 'Blastocrithidia Nuclear',
                        'Gracilibacteria', 'Alternative Flatworm
                        Mitochondrial', 'Echinoderm Mitochondrial',
                        'Invertebrate Mitochondrial', 'SGC0', 'Candidate
                        Division SR1', 'Dasycladacean Nuclear', 'SGC4',
                        'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium
                        Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma',
                        'Standard', 'Karyorelict Nuclear'}. By default, "SGC1"
                        is assigned to mitochondrial chromosomes. (default:
                        [])
  --start-codons [START_CODONS [START_CODONS ...]]
                        Default start codon(s) to use for novel ORF
                        translation. Defaults to ["ATG"]. (default: ['ATG'])
  --chr-start-codons [CHR_START_CODONS [CHR_START_CODONS ...]]
                        Chromosome specific start codon(s). For example,
                        "chrM:ATG,ATA,ATT".By defualt, mitochondrial
                        chromosome name is automatically inferred andstart
                        codon "ATG", "ATA", "ATT", "ATC" and "GTG" are
                        assigned to it. (default: [])
  --index-dir [<file>]  Path to the directory of index files generated by
                        moPepGen generateIndex. If given, --genome-fasta,
                        --proteome-fasta and --anntotation-gtf will be
                        ignored. (default: None)

Arguments

-h, --help

show this help message and exit

--se <file> Path

File path to the SE (skipped exons) junction count file output by rMATS. The file name should look like '*_SE.MATS.JC.txt' or '*_SE.MATS.JCEC.txt'. Valid formats: ['.tsv', '.txt']

--a5ss <file> Path

File path to the A5SS (alternative 5' splicint site) junction count file output by rMATS. The file name should look like '_S5SS.MATS.JC.txt' or '*_A5SS.MATS.JCEC.txt'. Valid formats: ['.tsv', '.txt']

--a3ss <file> Path

File path to the A3SS (alternative 3' splicint site) junction count file output by rMATS. The file name should look like '_S3SS.MATS.JC.txt' or '*_A3SS.MATS.JCEC.txt'. Valid formats: ['.tsv', '.txt']

--mxe <file> Path

File path to the MXE (mutually exclusive exons) junction count file output by rMATS. The file name should look like '_MXE.MATS.JC.txt' or '*_MXE.MATS.JCEC.txt'. Valid formats: ['.tsv', '.txt']

--ri <file> Path

File path to the RI (retained intron) junction count file output by rMATS. The file name should look like '_RI.MATS.JC.txt' or '*_RI.MATS.JCEC.txt'. Valid formats: ['.tsv', '.txt']

--min-ijc int

Minimal junction read count for the inclusion version to be analyzed. int
Default: 1

--min-sjc int

Minimal junction read count for the skipped version to be analyzed. int
Default: 1

-o, --output-path <file> Path

File path to the output file. Valid formats: ['.gvf']

--source str

Variant source (e.g. gSNP, sSNV, Fusion)

-g, --genome-fasta <file> Path

Path to the genome assembly FASTA file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the annotation GTF and proteome FASTA

-a, --annotation-gtf <file> Path

Path to the annotation GTF file. Only ENSEMBL and GENCODE are supported. Its version must be the same as the genome and proteome FASTA.

--reference-source str

Source of reference genome and annotation.
Choices: ['GENCODE', 'ENSEMBL']

--codon-table str

Codon table. Defaults to "Standard". Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'} str
Default: Standard
Choices: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}

--chr-codon-table str

Chromosome specific codon table. Must be specified in the format of "chrM:SGC1", where "chrM" is the chromosome name and "SGC1" is the codon table to use to translate genes on chrM. Supported codon tables: {'Alternative Yeast Nuclear', 'Protozoan Mitochondrial', 'Vertebrate Mitochondrial', 'Blepharisma Macronuclear', 'Chlorophycean Mitochondrial', 'Ascidian Mitochondrial', 'Ciliate Nuclear', 'Mesodinium Nuclear', 'Balanophoraceae Plastid', 'SGC9', 'Cephalodiscidae Mitochondrial', 'Trematode Mitochondrial', 'Pachysolen tannophilus Nuclear', 'SGC2', 'Yeast Mitochondrial', 'SGC5', 'Euplotid Nuclear', 'Scenedesmus obliquus Mitochondrial', 'Peritrich Nuclear', 'Archaeal', 'Coelenterate Mitochondrial', 'Bacterial', 'Mold Mitochondrial', 'SGC3', 'Hexamita Nuclear', 'Pterobranchia Mitochondrial', 'Plant Plastid', 'Condylostoma Nuclear', 'Blastocrithidia Nuclear', 'Gracilibacteria', 'Alternative Flatworm Mitochondrial', 'Echinoderm Mitochondrial', 'Invertebrate Mitochondrial', 'SGC0', 'Candidate Division SR1', 'Dasycladacean Nuclear', 'SGC4', 'Flatworm Mitochondrial', 'SGC8', 'Thraustochytrium Mitochondrial', 'SGC1', 'Spiroplasma', 'Mycoplasma', 'Standard', 'Karyorelict Nuclear'}. By default, "SGC1" is assigned to mitochondrial chromosomes. str
Default: []

--start-codons str

Default start codon(s) to use for novel ORF translation. Defaults to ["ATG"]. str
Default: ['ATG']

--chr-start-codons str

Chromosome specific start codon(s). For example, "chrM:ATG,ATA,ATT".By defualt, mitochondrial chromosome name is automatically inferred andstart codon "ATG", "ATA", "ATT", "ATC" and "GTG" are assigned to it. str
Default: []

--index-dir <file> Path

Path to the directory of index files generated by moPepGen generateIndex. If given, --genome-fasta, --proteome-fasta and --anntotation-gtf will be ignored.

--debug-level <value|number> str

Debug level. str
Default: INFO

-q, --quiet

Quiet
Default: False