Protein sequence databases

One of the essential requirements of the proteomics community is a high quality annotated nonredundant protein sequence database with an archival service and stable identifiers to enable protein identification and characterization.

EBI / EMBL-EBI - The European Bioinformatics Institute
Supplies an access to several biological data resources and bioinformatics services. EBI is a platform that covers the entire range of biological sciences: raw DNA sequences to curated proteins, chemicals, structures, systems, pathways, ontologies and literature. Databases, tools, as well as web services are provided for sharing data, performing queries and analyzing results. Users can also deposit their data through a data submission page. All the resources are freely available without restriction, with few exceptions.
Offers a seamless integration of and navigation through protein-related data. NeXtProt contains proteomics data for over 85% of human proteins. Moreover, this tool includes over 8000 phenotypic observations for over 4000 variations in a number of genes involved in hereditary cancers and channelopathies. All of the data are available via a user interface and FTP site. An API access and a SPARQL endpoint are also provided for more technical applications.
UniProt / Universal Protein Resource
A comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), the UniProt Archive (UniParc) and the UniProt Metagenomic and Environmental Sequences (UniMES) database. UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR produced the Protein Sequence Database (PIR-PSD).
CDD / Conserved Domain Database
Provides a public repository for annotation of proteins. CDD includes more than 56 000 records from all sources database dispatched into 5 600 multi-model superfamilies. The database incorporates biomolecular sequences accompanied by the location of evolutionarily conserved protein domain footprints and links to functional sites related to them. Searches can be made by protein or nucleotide query, accession, GI number, or with a sequence in FASTA format.
PATRIC / Pathosystems Resource Integration Center
Aims to assist scientists in infectious-disease research. PATRIC is a National Institute of Health (NIH) supported bioinformatics resource center that has been built to enable comparative genomic analysis of bacterial pathogens. The database provides researchers with an online resource that stores and integrates a variety of data types (e.g. genomics, transcriptomics, protein-protein interactions (PPIs), three-dimensional protein structures and sequence typing data) and associated metadata. Tools and services for bacterial infectious disease research are also available.
PED / Pancreatic Expression database
Provides functions to extract, analyze, and integrate publicly available multi-omics datasets. PED is developed as a data repository to provide researchers with a single-entry point from which to manipulate, mine and integrate the heterogeneous and isolated findings into their own research. It incorporates published findings on pancreatic precursor lesions, including pancreatic intraepithelial neoplasias (PanINs), intraductal papillary mucinous neoplasms (IPMNs) and mucinous cystic neoplasms (MCNs).
HPRD / Human Protein Reference Database
Provides access to experimentally derived information about the human proteome including protein–protein interactions (PPIs), post-translational modifications (PTMs) and tissue expression. HPRD is an integrated knowledgebase for genomic and proteomic investigators. The database also includes (i) PhosphoMotif Finder that contains known kinase/phosphatase substrate and binding motifs, (ii) links to a signaling pathway resource called NetPath, (iii) a distributed annotation system, called Human Proteinpedia for enhanced community participation and allows the use of BLAST for querying mRNA/protein data.
LIS / Legume Information System
A genomic data portal (GDP) for the legume family. LIS provides access to genetic and genomic information for major crop and model legumes. With more than two-dozen domesticated legume species, there are numerous specialists working on particular species, and also numerous GDPs for these species. LIS has been redesigned both to better integrate data sets across the crop and model legumes, and to better accommodate specialized GDPs that serve particular legume species. To integrate data sets, LIS provides genome and map viewers, holds synteny mappings among all sequenced legume species and provides a set of gene families to allow traversal among orthologous and paralogous sequences across the legumes.
NCBI Protein
Contains amino acid sequences created from the translations of coding regions provided on nucleotide records. Protein database is an online resource that provides related nucleotide sequences that originate from comparative studies: phylogenetic, population, environmental, and mutational. Each record in the database is a set of nucleotide sequences representing the same molecule from the same species, different identifiable species, or anonymous species from the same biological community. A collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.
RNaseP Database / Ribonuclease P Database
Provides a compilation of ribonuclease P sequences, sequence alignments, secondary structures, three-dimensional models and accessory information. RNaseP Database contains information on bacterial, archaeal and eukaryal (including organellar) RNase P, including both RNA and protein subunits. The data are available from lists arranged phylogenetically as individual sequences or as alignments in GenBank format. Organism names, accession numbers and citations are linked to the relevant NCBI/Entrez database records.
Supports functional Listeria genome analyses by combining information obtained by applying bioinformatics methods and from public databases to improve the original annotations. LEGER offers three unique key features: (i) it is the first comprehensive information system focusing on the functional assignment of genes and proteins; (ii) integrated visualization tools, KEGG pathway and Genome Viewer, alleviate the functional exploration of complex data; and (iii) LEGER presents results of systematic post-genome studies, thus facilitating analyses combining computational and experimental results.
CIPRO / Ciona Intestinalis PROtein database
An integrated protein database for the tunicate species C. intestinalis. The database is unique in two respects: first, because of its phylogenetic position, Ciona is suitable model for understanding vertebrate evolution; and second, the database includes original large-scale transcriptomic and proteomic data. Ciona intestinalis has also been a favorite of developmental biologists. Therefore, large amounts of data exist on its development and morphology, along with a recent genome sequence and gene expression data. The CIPRO database is aimed at collecting those published data as well as providing unique information from unpublished experimental data, such as 3D expression profiling, 2D-PAGE and mass spectrometry-based large-scale analyses at various developmental stages, curated annotation data and various bioinformatic data, to facilitate research in diverse areas, including developmental, comparative and evolutionary biology.
ProtClustDB / Protein Clusters Database
Provides several information about proteins. ProtClustDB is a resource composed of two functions: (1) update RefSeq genomes with curated gene and protein information; (2) provide a central aggregation source for information collected from a wide variety of sources that would be useful for scientists studying protein-level or genomic-level molecular functions. Information contained in this database is stored in four sets or groups: prokaryotes, phages, chloroplasts and mitochondria.
ChromDB / The chromatin database
Compiles information about chromatin-related proteins. ChromDB includes plant proteins to over 7474 proteins among 3328 plants, 1779 animals, 2143 fungi, 167 stramenopiles, and 57 protists. The sequences from the database are splitted in two categories: (i) genomic-based (limited to plant genomes and (ii) transcript-based derived from expressed sequence tag (EST) contigs or singlets. It also provides users with a variety of tools to visualize sequence information and to extract data by way of user-generated customized reports.
Analyses protein sequence, predicts function and searches sequence. Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. The sequences in the database are annotated with matches to Pfam, SCOP domains, and proteins in the protein data bank (PDB), using our HHblits homology detection tool. The database contains 17% more Pfam domain annotations than UniProt. The Uniclust server facilitates profiting from the Uniclust databases and deep HHblits domain annotations.
SIFGD / Setaria italica Functional Genomics Database
Provides search and analysis tools for bioinformatics analyses of gene function or regulatory modules. SIFGD was designed to integrate existing data from publications, to improve the proportion of gene annotation, and to provide popular functional analysis tools in a convenient format for use by Setaria researchers. Functional analysis modules, major components of SIFGD, are useful for studying biological processes, such as regulation, signaling, and metabolism.
MisPred / Miss Predict Protein Database
Allows to identify erroneous (abnormal, incomplete and mispredicted) protein sequences in public databases. MisPred is a database that contains more than 80800 erroneous sequences identified in 19 metazoan species. It provides for each entry the protein ID, the protein description, the database source, the species name and the type of sequence error(s) identified. Users can also analyse protein sequences for possible sequence errors using the MisPred quality control tools.
HSPIR / Heat Shock Protein Information Resource
Delivers information on six major heat shock protein (HSP) families across various genomes. HSPIR is stored following detailed sub-classification based on HSPs’ domain, structural organization and localization. It contains about 10 000 entries, from six manually-curated kingdoms, that expose all the major model organisms and about 300 3D structures. This tool can be useful for comparative analysis and explore additional physiological functions of HSPs in different species.
Provides a user-friendly tool allowing the rapid retrieval of ribosomal protein (r-protein) sequences for user-defined sets of prokaryotic species. The current version of RiboDB contains 90 r-proteins from 3,750 prokaryotic complete genomes encompassing 38 phyla/major classes and 1,759 different species. RiboDB bypasses the main step limiting the use of r-proteins to study prokaryotic systematics. Beside systematic considerations RiboDB represents a valuable resource for proteomics studies.
SmProt / Small Proteins database
Contains several information about small proteins. SmProt is a database that provides a user-friendly website for users to submit, browse, search, blast, download or export data about small proteins. This database includes a service for the BLAST alignment search and an integrated local UCSC Genome Browser service (allowing the visualization of the genomic locations of small proteins). It predicts the functions of the small proteins curated from ribosome profiling calculation and literature mining and describes a high confidence set of small proteins.
JCDB / Jatropha curcas DataBase
A database which offers gene annotation of Jatropha curcas, also known as Barbados nut, purging nut or physic nut. Jatropha curcas is currently attracting much attention as a plant with high potential for biofuel plantations, because of its unique characteristics like high seed oil content and easy propagation. With the rapid advance in molecular research of Jatropha curcas, there emerged a mass of large-scale genome, transcriptome and proteome analyses, which were applied for decoding the molecular network of Jatropha curcas. Jatropha curcas belongs to the Euphorbiaceae family.
hivmut / HIV Mutation
A database of mutagenesis and mutation information on Human Immunodefiency Virus (HIV). Hivmut describes the phenotypes of 7,608 unique mutations at 2,520 sites in the HIV proteome, resulting from the analysis of 120,899 papers. The mutation information for each protein is organised in a residue-centric manner and each residue is linked to the relevant experimental literature. The importance of HIV as a global health burden advocates extensive effort to maximise the efficiency of HIV research. The HIV mutation browser provides a valuable new resource for the research community.
A mass spectral reference database. The database consists of tryptic peptide fragmentation mass spectra derived from plants. This release 2.12/2013 contains 116,364 tryptic peptide product ion spectra entries of 48,218 different peptide sequence entries from Medicago truncatula, Chlamydomonas reinhardtii, Bradyrhizobium japonicum, Arabidopsis thaliana, Phaseolus vulgaris, Lotus japonicus, Lotus corniculatus, Lycopersicon esculentum, Solanum tuberosum, Nicotiana tabacum, Sinorhizobium meliloti, Glycine max, Zea mays.
