Computational protocol: Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes

Similar protocols

Protocol publication

[…] The researchers contributing to this collaboration have developed, tested and published their methodologies previously. Here, we provide summaries of these methods with reference to seminal publications and further information for each method.GeneSeeker () (,): GeneSeeker is a web tool that filters positional candidate disease genes based on expression and phenotypic data from both human and mouse. It queries several online databases directly through the web, guaranteeing that the most recent data are used at all times and removing the need for local repositories. In a test using 10 syndromes, GeneSeeker reduced the candidate gene lists from an average of 163 position-based candidate genes to an average of 22 candidates based on position and expression or phenotype. Though particularly well suited for syndromes in which the disease gene shows altered expression patterns in the affected tissues, it can also be applied to more complex diseases.Analysis of candidate gene expression using eVOC annotation (): This method performs candidate disease gene selection using the eVOC (a controlled vocabulary for unifying gene expression data) anatomy ontology. It selects candidate disease genes according to their expression profiles, using the eVOC anatomical system ontology as a bridging vocabulary to integrate clinical and molecular data through a combination of text- and data-mining. The method first makes an association between each eVOC anatomy term and the disease name according to their co-occurrence in PubMed abstracts, and then ranks the identified anatomy terms and selects candidate genes annotated with the top-ranking terms. Candidate disease genes are thus selected according to their expression profiles within tissues associated with the disease of interest. In a test of 20 known disease associated genes, the gene was present in the selected subset of candidate genes for 19/20 cases (95%), with an average reduction in size of the candidate gene set to 64.2% (±10.7%) of the original set size.Disease Gene Prediction (DGP) () (): The genes that are already known to be involved in monogenic hereditary disease have been shown to follow specific sequence property patterns that would make them more likely to suffer pathogenic mutations. Based on these patterns, DGP is able to assign probabilities to all the genes that indicate their likelihood to mutate solely based on their sequence properties. In particular, the properties analysed by DGP are protein length, degree of conservation, phylogenetic extent and paralogy pattern. The performance of this method has been assessed previously on a test dataset by building a model with a part of the data (learning set: 75%) and testing with the rest (test set: 25%). On average 70% of the disease genes in the test set were predicted correctly with 67% precision (). Genes involved in complex diseases, similarly to monogenic disease genes, need to have mutations or variations in the gene sequence that impair or modify the function or expression of the protein they encode, leading to a disease phenotype. Thus, we believe that, although DGP has been designed for the prediction of mendelian diseases, it can also be useful for the identification of complex-disease genes as it will identify those genes with higher likelihood of suffering mutations.PROSPECTR and SUSPECTS () (,): It can be shown that genes implicated in disease share certain patterns of sequence based features like larger gene lengths and broader conservation through evolution. PROSPECTR is an alternating decision tree which has been trained to differentiate between genes likely to be involved in disease and genes unlikely to be involved in disease. By using sequence-based features like gene length, protein length and the percent identity of homologs in other species as input a score (ranging from 0 to 1) can be obtained for any gene of interest. Genes with scores over a certain threshold, 0.5, are classified as likely to be involved in some form of human hereditary disease while genes with scores under that threshold are classified as unlikely to be involved in disease. The score itself is a measure of confidence in the classification. PROSPECTR requires only basic sequence information to classify genes as likely or unlikely to be involved in disease.SUSPECTS builds on this by incorporating annotation data from Gene Ontology (GO), InterPro and expression libraries. Candidate genes are scored using PROSPECTR and also on how significantly similar their annotation is to a set of genes already implicated in the same disorder (the ‘training set’). This enables SUSPECTS to rank genes according to the likelihood that they are involved in a particular disorder rather than human hereditary disease in general. SUSPECTS leverages the structure of the GO, requiring GO terms to be closely enough related semantically speaking to be considered significant (). As a rank-based system, it requires potential candidates to share GO terms with other disease genes to a greater extent than the other genes in the same region of interest.Performance of both PROSPECTR and SUSPECTS was tested separately with a set of oligogenic and complex disorders including Alzheimer's disease, hypertension, autism and systemic lupus erythematosus. At least two implicated genes for each disease were available. For each implicated gene, a region of interest was created containing the implicated gene itself (the ‘target gene’) and every gene within 7.5 Mb on either side. On average each region of interest contained 155 genes. Associated training sets were then created for SUSPECTS containing the remaining implicated genes for each disorder.Using PROSPECTR, on average the target gene was in the top 31.23% of the resulting ranked lists of candidates and in the top 5% of those lists 20 times out of 156 (13%). In comparison, on average the target gene was in the top 12.93% of the ranked list from SUSPECTS, which took both the region of interest and the relevant training set as input in each case. The target gene was in the top 5% of the ranked list 87 times out of 156 (56%) (,). G2D () (,): This system scores all terms in GO according to their relevance to each disease starting from MEDLINE queries featuring the name of the disease. This is done by relating symptoms to GO terms through chemical compounds, combining fuzzy binary relations between them previously inferred from the whole MEDLINE and RefSeq databases. Then, to identify candidate genes in a given a chromosomal region, G2D (genes to diseases) performs BLASTX searches () of the region against all the (GO annotated) genes in RefSeq. All hits in the region with an E-value <10e−10 are registered and sorted according to the GO-score of the RefSeq gene they hit (the average of the scores of their GO annotations). Note that hits in the genome might correspond to known or unknown genes, or to a pseudogene. In a test with 100 diseases chosen at random from OMIM (Online Mendelian Inheritance in Man) (), using bands of 30 Mb [the average size of linkage regions ()], G2D detected the disease gene in 87 cases. In 39% of these it was among the best three candidates, and in 47% among the best 8 candidates (). POCUS () (): POCUS exploits the tendency for genes predisposing to the same disease to have identifiable similarities, such as shared GO annotation, shared InterPro domains or a similar expression profile. Therefore where genes within different susceptibility regions for the same disease share GO or InterPro annotation and/or are co-expressed, these genes may be considered good candidates. Although genes may be selected as candidates on the basis of sharing only a single GO term or InterPro domain, genes lacking this annotation completely will not be selected. Some polygenic/complex diseases may be caused by different genes that are not functionally related. In such cases this method would not be expected to select the disease genes as candidates, but may still, by chance, find functional similarities between some other genes in the regions (especially where there are many regions or the regions contain many genes). Each observed similarity between genes in different regions is given a score. The score is based on the probability of seeing such specific (or more specific) similarities between genes in different randomly chosen regions of the genome containing many genes. Where such a specific (or more specific) similarity would not be seen by chance in >5% of sets of randomly chosen region analysed, the similar genes are considered to be good candidates. Therefore in cases where disease genes are not functionally related (or where there is no data to suggest the disease genes are functionally related) POCUS will select no candidate genes in 95% of cases. This means that POCUS is far more conservative than the other methods discussed. Where many large regions are analysed almost any similarity between genes in different regions will have a considerable probability of being seen by chance. Therefore this method is not likely to be successful when many large regions are analysed, so analysis should be restricted to the most tightly defined and best-supported regions available.The performance of POCUS was tested by using it to look for known disease genes. Test susceptibility regions were created containing known disease genes and the surrounding genes (). Test susceptibility regions were created for 120 diseases for which more than one associated gene appears in the OMIM database. POCUS was then used to analyse the set of test regions corresponding to each disease. The performance was measured by the percentage of known disease genes selected as candidates from the test regions. The enrichment for disease genes in the selected genes compared to the whole susceptibility region was also considered. Enrichment was calculated as Enrichment = (disease genes selected/non-disease genes selected) / (disease genes in region/all genes in region). Where the test regions contained 20 genes in total the percentage of disease genes found was 41.7% and enrichment was 10.5-fold. For 100 genes the equivalent figures were 25.8% and 36.9, respectively, and for 200 genes 14.9% and 46.3. It is important to note that these results were obtained with no prior knowledge of disease pathogenesis. However, POCUS can also take into account prior knowledge of the disease, either in the form of known disease genes or preferred genes that are weighted during the analysis. Preferred genes could be genes expressed in the affected tissue or genes selected by other programs as being likely candidates. […]

Pipeline specifications

Software tools GeneSeeker, PROSPECTR, G2D, BLASTX, POCUS
Application Phylogenetics
Diseases Diabetes Mellitus, Type 2, Obesity