What is protein function prediction?


Identifying the function of a previously unknown protein is a difficult task. While some technologies such as RNA interference enable to assess the function of a protein experimentally, they are often labor-intensive and cannot match the rate at which new sequences and genes are being identified with high-throughput sequencing techniques.

 

Prediction by computational methods has thus become a privileged way of identifying new functions of unknown proteins. Here, we will briefly present the most recent methods and resources in computational protein function prediction.

 

Protein functions

 

Protein function is a wide term that includes several aspects of a protein activity. Having a common vocabulary to describe protein functions is essential for function prediction and annotation. The Gene Ontology (GO) Consortium has established a classification of protein functions that uses a defined vocabulary for annotation. They divide protein functions into three main non-exclusive aspects:

 

  • Molecular function

Molecular function describes activities, such as direct physical interactions with other molecular entities, at the molecular level. GO molecular function terms represent activities and actions rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place. Example: Transporter activity.

 

  • Biological process

A biological process is a series of events accomplished by one or more ordered assemblies of molecular functions. Example: Signal transduction.

 

  • Cellular component

A cellular component is a component of a cell that is part of some larger object, such as a macromolecular machine. It can be an anatomical structure (plasma membrane, mitochondrion) or a gene product group (ribosome).

 

Note that gene products can be characterized by all three categories. For example, “cytochrome c” can be described by the Molecular Function term “oxidoreductase activity”, the Biological Process term “oxidative phosphorylation”, and the Cellular Component terms “mitochondrial matrix” and “mitochondrial inner membrane”.

 

Protein function prediction methods and tools

 

Protein function prediction uses various bioinformatics approaches to identify or predict biological or biochemical roles to newly discovered proteins or predicted gene-products.

Leading techniques rely on sequence homology, structure and motif homology, genomic context, or a combination of all methods.

 

  • Homology-based function prediction

 

The function(s) of a protein is coded within its amino-acid sequence. Hence, proteins with similar sequences often share similar function. With homology-based protein function prediction, new protein sequences are aligned and compared to annotated proteins using specific algorithms, and functions are predicted based on similarity. While similarity percentages often correlate with similar function, exceptions exist. Some proteins with similar functions have totally different sequences, and proteins with similar sequences can have different functions. Modern methods of homology-based function prediction often use data beyond sequence similarity, such as types of evolutionary relationships, protein structure, protein-protein interactions or gene expression data.

 

Top tools for homology-based function prediction:

Argot, a combined approach based on the clustering process of GO terms dependent on their semantic similarities and a weighting scheme which assesses retrieved hits sharing a certain degree of biological features with the sequence to annotate.

CombFunc, an algorithm that identifies conserved residues present in alignments of proteins with the same GO annotations and uses them to assign function to a query sequence.

 

  • Structure-based function prediction

 

Protein 3D structures play a fundamental role in their functions and are generally more well conserved than protein sequences. Therefore, structural similarity is often a good indicator of a shared function between two or more proteins. Structure-based function prediction algorithms compare protein structures of unknown proteins to structure databases, or directly predict protein 3D models from sequences before comparing them. This method is often used on particular structural motifs such as active or binding sites instead of the whole protein.

 

Top tools for structure-based function prediction:

I-TASSER, allows automated protein structure prediction and structure-based function annotation.

COFACTOR, a structure-based method for biological function annotation of protein molecules. COFACTOR will thread the structure through the BioLiP protein function database by local and global structure matches to identify functional sites and homologies.

 

  • Genomic context-based function prediction

 

As proteins are coded by genes, it is safe to assume that proteins coded by similar (in sequence) or evolutionary related genes would have similar functions. Taking advantage of the increasing number of sequenced genomes, genomic context-based function prediction relies on preexisting information such as chromosomal positioning relative to other genes as well as its evolutionary record among the genomes. This gives information on the biological processes in which the unknown protein may be involved.

 

Top tools for genomic context-based function prediction:

PhyloPFP, a tool that exploits phylogenetics to establish the evolutionary distance of sequences retrieved from database searches.

SIFTER, a phylogeny-based protein function prediction algorithm.

 

Find out more tools for protein function prediction on omicX.