Computational protocol: MeSH ORA framework: R/Bioconductor packages to support MeSH over-representation analysis

Similar protocols

Protocol publication

[…] org.MeSH.XXX.db-type packages provide the correspondence between NCBI Entrez Gene IDs and NLM MeSH IDs for each organism. Only the data that have sufficiently high correspondence between the Entrez Gene and MeSH IDs were selected.First, we focused on the organisms used in at least one of five available genome-wide tools: Affymetrix GeneChip [], Gene Ontology [], Bioconductor [], UCSC Genome Browser [] and Gendoo [] (Figure ). Overall, 168 organisms were selected by this criterion (Figure ). Figure 2 Figure 3 Next, we focused on the organisms that had at least 1000 of Entrez Gene IDs. Some organisms have only Ensembl Gene IDs but no Entrez Gene IDs, and because our framework uses only Entrez Gene IDs, such organisms were rejected. After this step, 115 organisms remained.Finally, we classified the remaining organisms as major or minor. Here we defined major and minor organisms by whether the organisms have at least 100 PubMed IDs. We defined 15 well-annotated organisms as “major organisms”: Arabidopsis thaliana, Bacillus subtilis subsp. spizizenii str. 168, Bos taurus, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Escherichia coli str. K-12 substr. MG1655, Gallus gallus, Homo sapiens, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae S288c, Shizosaccharomyces pombe 972h-, Sus scrofa, and Xenopus laevis. In some cases, thousands of Entrez Gene IDs are assigned in a single publication. Most such papers describe the determination of genome sequences or genome projects/databases rather than specific gene functions. Therefore, we omitted those papers that were assigned more than 1000 Entrez Gene IDs.Three ways of corresponding the Entrez Gene and MeSH IDs was used in these organisms: Gendoo (http://gendoo.dbcls.jp/data/), gene2pubmed (ftp://ftp.ncbi.nih.gov/gene/DATA/), and RBBH (reciprocal BLAST best hit). Gene2pubmed is used against 15 major and 100 minor organsisms, and RBBH is used against only the 100 minor organisms. Gendoo is used against only 9 organisms included in the Gendoo website (http://gendoo.dbcls.jp/). Figure shows the 120 organisms and their data sources. Figure 4 1) gene2pubmed gene2pubmed provides the correspondence between the Entrez Gene and PubMed IDs assigned by NCBI manual curation. We converted these PubMed IDs to their corresponding MeSH terms using data licenced by PubMed (http://www.nlm.nih.gov/databases/license/license.html), and then converted these MeSH terms to MeSH IDs using the data from the NLM MeSH (Figure ). Figure 5 gene2pubmed provides the correspondence between the Entrez Gene and PubMed IDs assigned by NCBI manual curation. We converted these PubMed IDs to their corresponding MeSH terms using data licenced by PubMed (http://www.nlm.nih.gov/databases/license/license.html), and then converted these MeSH terms to MeSH IDs using the data from the NLM MeSH (Figure ). Figure 5 2) RBBH Due to the lack of relevant literature, minor organisms were poorly annotated by 1) gene2pubmed approarch. Because many gene and protein annotations are based on sequence similarity against other organisms such as FANTOM project [,], GO [], blast2go [], InterProScan [], and RAPSearch [], we conducted a sequence homology search of these minor organisms against the major organisms and then applied the MeSH IDs of the major organisms to these minor organisms (Figure ). We performed a reciprocal BLAST [] best-hit search among all possible combinations of the 15 major organisms and 100 minor organisms (Figure ). A total of 3000 (100 minor organisms × 15 major organisms × 2-direction) blastp programs were performed (E-values are 50). Figure 6 Figure 7 3) Gendoo The Entrez GeneID - MeSH ID relationship in Gendoo was assigned using a text-mining technique. The degree of relevance between the Entrez Gene and MeSH IDs was assessed by the Kullback-Leibler divergence []. Due to the lack of relevant literature, minor organisms were poorly annotated by 1) gene2pubmed approarch. Because many gene and protein annotations are based on sequence similarity against other organisms such as FANTOM project [,], GO [], blast2go [], InterProScan [], and RAPSearch [], we conducted a sequence homology search of these minor organisms against the major organisms and then applied the MeSH IDs of the major organisms to these minor organisms (Figure ). We performed a reciprocal BLAST [] best-hit search among all possible combinations of the 15 major organisms and 100 minor organisms (Figure ). A total of 3000 (100 minor organisms × 15 major organisms × 2-direction) blastp programs were performed (E-values are 50). Figure 6 Figure 7 The Entrez GeneID - MeSH ID relationship in Gendoo was assigned using a text-mining technique. The degree of relevance between the Entrez Gene and MeSH IDs was assessed by the Kullback-Leibler divergence [].A summary of the 120 organisms in each MeSH category is given in Additional file . […]

Pipeline specifications