Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models.
Predicts a peptide’s proteotypic propensity based on its physico-chemical properties. PeptideSieve (i) performs an in silico digest of the protein, (ii) converts each of the peptides into chemical property strings, and (iii) computes a likelihood function, which scores the likelihood each peptide is proteotypic. The resulting predictors have the ability to accurately identify proteotypic peptides from any protein sequence and offer starting points for generating a physical model describing the factors.
Offers an approach for discovering novel peptides. IPAW provides methods dedicated to the curation and validation of novel peptides, including single amino acid variant (SAAV) peptides, by using multiple independent sources. It can be applied to several search strategies such as high-resolution isoelectric focusing mass (HiRIEF) data in six-frame translation (6FT) search or concatenated database search using databases derived from sequencing data.
Offers a platform for generating annotations for protein-coding sequences. iPtgxDBs is an open source application that merged various annotation files resulting either from processing other modules furnished in the platform or from an external file. The program clustered and hierarchically gathered the annotation for finally creating: (i) a FASTA database including informative identifiers and (ii) a file containing all annotations.
A proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. PROTEOFORMER can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo.
A program that can use data from public repositories of genetic variants or sample-specific next-generation sequencing (NGS) data to automate the generation of variant-containing databases and thereby enable detection of single amino acid polymorphisms (SAP)-containing peptides. sapFinder also allows for efficient tandem mass spectrometry (MS/MS) data searching, post-processing and report generation from HTML-based format.
Allows in-depth visualization of prokaryotic transcriptomic and proteomic data in conjunction with genomics data. MINOMICS generates interactive linear genome maps in which multiple experimental datasets are displayed together with operon, regulatory motif, transcriptional promoter and transcriptional terminator information. The linear chromosome maps created by MINOMICS provide researchers with a tool to comprehensively mine their experimental data. The tool facilitates documenting this procedure and sharing the results by allowing researchers to export currently displayed genome maps to publication grade images.
A tool for improving the existing genomic annotations from available proteomics mass spectrometry data. As most genome annotation pipelines consist of automated gene finding, they lack experimental validation of primary structure, having to rely on DNA centric sources of data such as sequence homology, transcriptome mapping, codon frequency, etc. By incorporating the orthogonal set of data, proteogenomics is able to discover novel genes, post-translational modifications (PTMs) and correct the erroneous primary sequence annotations.
Gathers protein databases for prokaryotic organisms. MSMSpdbb is a free standalone software that compiles and clusters various bacterial strains to obtain their complete genomic translations. It aims to generate protein databases intending to detect sequence variations between strains such as single nucleotides polymorphisms (SNPs) and divergent translational start site (TSSs) as well as erase annotation errors and genetic variations among closely related organisms.
Constructs customized transcript databases for tandem mass spectra search. RNA-Seq data is used to generate transcripts and to resolve shared peptide protein inference in a proteogenomic network. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference.
Provides a proteogenomic analysis application. pAnno combines several functions: (i) browsing genome, (ii) constructing a frame translation protein database for tandem mass-spectrometry (MS/MS) identification, and (iii) using MS/MS searching result to re-annotate genome. This software need a registration to be downloaded.
Allows users to identify translated fusions and micro structural variations (microSVs) in matching omics datasets. ProTIE is a standalone software composed of two mains features: (i) MiStrVar, that permits to capture multiple types of microSVs in WGS dataset and (ii) DeFuse, a fusion detection tool. The application also incorporates RNA-Seq evidence to validate expressed microSVs and to detect translated peptides from genomic and transcriptomic aberrations.
Maps and integrates peptide data to genome structure. PV is an easy-to-use and straightforward tool to visualize and assert MS2 quality of any proteomic dataset. It can be used to any dataset that is not necessarily acquired under a proteogenomic effort. It was designed as a tool to help proteogenomic researchers to rank the quality of novel peptide forms based on better integration and accessibility of parameters.
Creates multiple databases of DNA polymorphisms, mutations, splice junctions, partially trypticity, as well as protein fragments translated from the whole transcriptome in all six frames upon RNA-seq de novo assembly. JUMPg is an effective proteogenomics tool for multi-omics data integration. This pipeline includes customized database construction, tag-based database search, peptide-spectrum match filtering, and data visualization.
A software tool designed to perform every necessary task of proteogenomic searches quickly, accurately and automatically. The software generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns confidence values to those matches. Peppy automatically performs a decoy database generation, search and analysis to return identifications at the desired false discovery rate threshold. Written in Java for cross-platform execution, the software is fully multithreaded for enhanced speed. The program can run on regular desktop computers, opening the doors of proteogenomic searching to a wider audience of proteomics and genomics researchers.
A Java-based software package that is designed to integrate genomic and transcriptomic data generated from next-generation sequencing with proteomic data generated from protein mass spectrometry. PG Nexus allows users to covisualize peptides in the context of genomes or genomic contigs, along with RNA-seq reads. This is done in the Integrated Genome Viewer (IGV). A Results Analyzer reports the precise base position where LC-MS/MS-derived peptides cover genes or gene isoforms, on the chromosomes or contigs where this occurs. In prokaryotes, the PG Nexus pipeline facilitates the validation of genes, where annotation or gene prediction is available, or the discovery of genes using a "virtual protein"-based unbiased approach.
An open source software suite for analysis and visualization of proteogenomic data. PGTools is comprised of applications, libraries, customized databases and visualization tools for analysis of mass-spectrometry data using combined proteomic and genomic backgrounds. A single command is sufficient to search databases, calculate false discovery rates, group and annotate proteins, generate peptide databases from RNA-Seq transcripts, identify altered proteins associated with cancer and visualize genome scale peptide datasets using sophisticated visualization tools.
Supports proteogenomic integration of mass spectrometry proteomics data with next-generation sequencing by mapping identified peptides onto their putative genomic coordinates. PGx represents a useful contribution to the software toolset of any proteogenomics practitioner because it does not impose any mass-informatic on the proteomics branch of the workflow but simply relies on three files summarizing the results of the intermediate sequencing efforts to establish full data integration: a BED file, a FASTA file, and a set of peptides.
A Python-based proteogenomic pipeline providing automated single-amino-acid polymorphism (SAP), indel, and alternative-spliced-variants discovery based on raw transcriptome and exome sequence data, single-nucleotide polymorphism (SNP) annotation and filtration, and the prediction of proteotypic peptides. PPLine integrates into a single pipeline a set of popular tools: Trimmomatic, Tophat2, SAMtools, GATK, Cufflinks, and Annovar.
A proteogenomic pipeline that is based on a nucleotide exon graph. This pipeline consists of constructing a compact nucleotide exon graph that systematically incorporates novel splice variations and a search tool that identifies peptides by directly searching the nucleotide exon graph against tandem mass spectra. NextSearch outputs the proteome-genome/transcriptome mapping results in a general feature format (GFF) file, which can be visualized by public tools such as the UCSC Genome Browser.