DataOnTools #5: Tools in the literature

Scientific communications and publications are the best way to promote your work – whether you are a tool developer or a bioinformatician conducting analysis. In this article, we analyze tools publications to explore a new aspect of the evolution of the field of bioinformatics.

Tool publications and citation

In the OMICtools database, nearly eight out of ten have been published in a peer-reviewed journal, with 41.7% of all tools cited at least once, while 36.0% have never been cited in the literature (Figure 1a). Among those with a PMID, about half of them have never been cited in the literature, and the other half has been cited at least once. This is of importance because citations are the best way for your article to be read.

 

However, getting more citations is not (always) related to the quality of your work, but may rather be a matter of good timing. To test this hypothesis, we analyzed the extent of tool citation in relation to the timing of development of a new technology, taking into account their publication age (Figure 1b).

 

tool citation pubmed score omictools
Figure 1. Bioinformatic tools in the literature. (a) Proportion of tools in the OMICtools database that are not associated with a PMID (gray), that have a PMID but have never been cited (yellow), or that have been cited at least once (blue). (b) Mean citation score of all tools for RNA-sequencing analysis by publication year. For each tool with an associated PMID, the citation score is the number of time this tool has been cited in the literature divided by its publication age (the number of year since its publication; 2017 – year of publication). The bar plot represents the mean citation score for all tools published in a given year + standard error of mean (SEM). Statistical significance is indicated by *p < 0.05, ****p < 0.0001. Kruskal-Wallis multiple comparison test was used to compare every group to the “2009” group. ns: not significant.

 

A clear trend emerged as seen with the example of RNA-sequencing technology, with the first papers published in 2008; tools dedicated to analysis of RNA-sequencing data that were published in 2009 are on average significantly more cited that tools published in subsequent years (P <0.05), irrespective of the impact factor of the journal of publication (data not shown).

 

This trend was observed with other technologies including WGS, CHIP-seq and CLIP-seq, suggesting that tools that are the first to resolve a problem are more likely to be established as gold standards or default methods, and by consequence accumulate more citations over time.

The rise of tool pipelines

There are several ways for a bioinformatics tools publication to be cited. For example, when a new version of the tool is release, or in the material and method section of an analysis paper.

 

For an arguably extended period, biological data consisted of a handful of sequences to be analyzed and compared, which could be done in a few computational steps and by using a single program. However, due to their complexity and quantity, to obtain meaningful data, today’s biological datasets require multiple analysis steps that often need a series of different programs that must be run in a specific order.

 

To verify this, we followed the evolution of co-citations (the number of tools cited per publication) and observed an increase in the number of tools jointly cited in scientific publications over time (Figure 2). While publications in the early 2000s were citing one to five tools at most, the number of tools cited per publication has continuously increased since 2005, with 20% of publications in 2015 citing more than six tools.

tools citation pubmed omictools
Figure 2. Tool pipelines in the literature. Evolution of the proportion of tools cited per publication among publications citing at least one tool from 2000 to 2015. Individual publications citing at least one tool registered in the OMICtools database were retrieved using the MEDLINE API.

 

These results indicate a shift in biological data complexity, now requiring the use of pipelines of tools for effective and productive analysis.