Rapidly evolving sequencing technologies (Zhang et al., 2011), (Capriotti et al., 2012) have led to a dramatic rise in the number of published articles reporting associations between genomic variations and diseases. There is an estimate that over 10,000 articles are published each year mentioning such associations (Burger et al., 2014). Manually collecting this information is both expensive and time consuming. To assist this manual curation, several text-mining (TM) efforts have been attempted. However, most of these efforts are limited to identifying mutation mentions only. The majority utilize regular expressions to detect mutations, although there are some, like tmVar (Wei et al., 2013) and VTag (McDonald et al., 2004), that use conditional random fields (CRFs), and SETH (Thomas et al., 2014), which implements an Extended Backus-Naur Form (EBNF) grammar. Only a few of these efforts extend the mutation detection method to associate the mutation with a disease phenotype. Most of these are search based TM tools that do not employ automatic extraction of the mutation-disease relationships expressed in articles.
(Mahmood et al., 2016) DiMeX: A Text Mining System for Mutation-Disease Association Extraction. PLoS One.
(Zhang et al., 2011) The impact of next-generation sequencing on genomics. J Genet Genomics.
(Capriotti et al., 2012) Bioinformatics for personal genome interpretation. Brief Bioinform.
(Burger et al., 2014) Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. Database 2014.
(Wei et al., 2013) tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics.
(McDonald et al., 2004) An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics.
(Thomas et al., 2014) SETH: SNP Extraction Tool for Human Variations.