Computational protocol: A realistic assessment of methods for extracting gene/protein interactions from free text

Similar protocols

Protocol publication

[…] The five GPI corpora used in this evaluation were: the AIMed corpus [], the BioInfer corpus [], the HPRD50 corpus [], the IEPA corpus [], and the LLL training corpus, a GPI corpus produced for the LLL challenge []. Here we provide a short summary of the five corpora. For a more detailed comparison, see [].The AIMed corpus contains 225 abstracts manually annotated for interactions between human genes and proteins. Most of the abstracts contain interactions, but a significant percentage (around 10%) do not, and were deliberately added to provide negative examples. The HPRD50 corpus contains 50 abstracts in which human gene and protein names were automatically identified using the ProMiner protein and gene name tagger []. The IEPA (Interaction Extraction Performance Assessment) corpus contains 303 abstracts from PubMed, each containing a specific pair of co-occurring chemicals obtained using 10 queries chosen to represent diverse biological research topics. The LLL corpus was created as the shared dataset for the Learning Language in Logic 2005 (LLL05) challenge and contains 77 sentences. The domain of LLL is gene interactions of Bacillus subtilis. The BioInfer corpus consists of 1100 sentences from PubMed abstracts that contain at least one pair of interacting genes or proteins. All protein, gene and RNA entities were manually annotated, together with all interactions between these entities, including static relations. Each interaction is mapped to the Bioinfer relationship ontology, defined especially for this purpose. BioInfer permits the annotation of relationships with a complex structure, such as relationships between relationships, or relationships of more than two entities.These corpora differ significantly in their working definitions of the concept "gene/protein interaction". For example, in the IEPA corpus an interaction is a "direct or indirect influence of one on the quantity or activity of the other" [], whereas BioInfer additionally contains so-called "static" entity relationships, such as family membership. Nevertheless, an analysis by Pyysalo and co-workers has shown that "a clear majority of all interactions [in these corpora]... correspond to events occurring as part of biochemical processes in living cells", as opposed to static relationships []. A more recent paper by Pyysalo and a different set of co-workers advocates addressing the extraction of static relationships as a distinct subtask [], but this is not tackled by existing publicly-available tools.For our analysis we converted all five corpora to a unified format using the conversion software provided by Pyysalo and co-workers []. To simplify our analysis, all 68 sentences in the BioInfer corpus that contain at least one discontinuous entity were discarded. For example, in the phrase 'myosin heavy chain and light chains', the annotated entities are 'myosin heavy chain' and 'myosin light chains', although the latter does not appear as a continuous string in surface text. [...] In two earlier papers we concluded that the version of ABNER [] trained on the BioCreAtIvE corpus [] was the best performing tagger on a range of biomedical corpora [] and on a new corpus – ImmunoTome – consisting of ten full-text immunological articles [].However, since the publication of those papers, we have evaluated BANNER, a new biomedical named-entity recognition system implemented using conditional random fields []. BANNER exploits a range of orthographic, morphological and shallow syntax features, such as part-of-speech tags, capitalisation, letter/digit combinations, prefixes, suffixes and Greek letters. As with the best-performing version of ABNER, BANNER was trained on the BioCreAtIvE corpus.As shown in tables and , BANNER consistently outperforms ABNER on the same corpora used in our earlier evaluations (Yapex [], GENIA [], ProSpecTome [] and ImmunoTome []), and has therefore been used for the analysis of GPI methods we present here. [...] A number of different GPI extraction methods have been published in the literature (for recent discussions of published methods see [] and []), with some 26 teams submitting runs for at least one of the GPI annotation extraction tasks at BioCreAtIvE II [].However, our purpose here was to undertake an evaluation of the state-of-the-art in GPI extraction relevant to potential non-specialist users. In contrast to entity taggers, a number of which are easy to install locally or can be accessed directly via the Web, none of the GPI extraction methods are trivial to install and use. This is partly a consequence of the complex, modular nature of a typical state-of-the-art GPI method that combines third-party components (a part-of-speech tagger and one or more parsers) with a machine learning or rule-based algorithm for identifying possible relationships within a given parse. As noted in [], the vast majority of such GPI methods are currently not publicly available.Here we focus on four GPI methods: AkanePPI, Whatizit, OpenDMAP, and a simple benchmark approach that we developed ourselves using Perl regular expressions. One system we have not evaluated, even though it is designed primarily for non-specialists and is easy to use, is iHOP []. iHOP is a dictionary-based system that uses genes and proteins as hyperlinks between sentences and abstracts in order to navigate information in PubMed. When it comes to GPI, for every gene detected in a query, there is a link that leads to sentences (and subsequently abstracts) which describe interactions of that gene with other genes. However, iHOP does not accept text submitted by the user, making it unsuitable for the analyses we undertook for this paper.AkanePPI [] is a state-of-the-art GPI method for which the C++ source code is publicly available. AkanePPI combines the version of the deep syntactic parser Enju that has been retrained on the GENIA corpus [] with a shallow dependency parser []. A support vector machine with tree kernels [] is used to extract rules for identifying pairs of interacting genes/proteins from a training corpus. Here we used two versions of AkanePPI, the original, distributed version (AkanePPI(A)) trained on the AIMed corpus, and a second version ((AkanePPI(B)) we retrained ourselves on the BioInfer corpus. The authors report an F-score of 52% for GPI extraction from unseen abstracts [].OpenDMAP [] is a general-purpose parsing and information extraction platform that provides an Open Source Java API. It was adapted to perform GPI extraction for the Protein Interaction Pairs subtask at BioCreAtIvE II [], where it outperformed other participating systems, achieving precision of 39% and recall of 31% when scores were averaged over articles []. OpenDMAP uses a rule-based approach. For BioCreAtIvE II, patterns were devised manually from the BioCreAtIvE, PICorpus [] and Prodisen [] corpora in consultation with biologists. These patterns have been made available for download together with the main distribution and have been used here.Whatizit [] is a modular text processing system available through the EBI website. Of the wide range of text mining services on offer, here we focused exclusively on the protein interaction pipeline. (The core pipeline is available separately from Whatizit in the form of the Protein Corral web application []. However, Protein Corral is designed to perform Medline searches and does not accept text submitted by the user.) The pipeline begins by mapping gene/protein names to UniProt identifiers using dictionary look-up. It then attempts to identify relationships between any successfully mapped names using three approaches of decreasing precision, but increasing coverage: natural language processing (Ppi); the co-occurrence of two gene/protein names with an interaction verb (Co3); and the co-occurrence of two names without an interaction verb (Co). The abbreviations here are the ones used on the Protein Corral website.Finally, we developed our own simple baseline method using Perl regular expressions. Every time two gene/protein names occur together within a sentence and have an interaction keyword between them they are predicted to be an interacting pair of genes/proteins. A minority of the interaction words were inherited from two earlier projects – GIFT [] and GraphSpider []. The former derived its verb list from FlyBase [] and the latter from the LLL training corpus. The remaining verbs were obtained semi-automatically using the clueType event attribute in the GENIA event corpus []. Our list of interaction keywords and Perl script are available as supplementary material [see Additional file and Additional file respectively].In addition, we compare the performance of our baseline method with the simpler co-occurrence baseline previously used by Pyysalo and co-workers, which predicts an interaction between every pair of genes/proteins co-occurring in a sentence irrespective of whether an interaction verb is present []. To easily distinguish between these two baseline methods within this paper, we call our keyword baseline method Baseline(K) and the simple co-occurrence baseline method of Pyysalo et al. Baseline(C). […]

Pipeline specifications