Computational protocol: Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

Similar protocols

Protocol publication

[…] The PSD model was derived from the sequential dependence (SD) model developed by Metzler et al. () and Bendersky et al. () The original SD models rank documents by considering the unigrams (i.e. single words), ordered bigrams (i.e. two consecutive words) and unordered bigrams (i.e. two words not necessarily consecutive) in documents. In our scenario, the ‘documents’ refer to the metadata of datasets to be re-ranked. In the experiments, we found that neither ordered bigrams or unordered bigrams provided contributions to the performance improvement. One possible explanation is that most keywords are independent of each other, and meaningful bigrams (and n-grams) were likely too sparse and rarely at the intersection of queries and metadata. For example, ‘chromatin modification’ contains more specific information than ‘chromatin’ and ‘modification’, while ‘flu car’ is as informative as ‘flu’ and ‘car’. Bigrams may help with the former example, but not with the latter. In addition, including bigrams results in higher computational complexity, making real-time retrieval difficult. Therefore, we removed the bigram components from the original formula, and modified the unigram component to make it compatible for dataset retrieval tasks, i.e. making ‘whether a word occurs in the metadata’ more important than ‘how many times a word occurs’.Provided with a query and a list of candidate datasets from Elasticsearch, PSD scores every candidate dataset and re-ranks them all accordingly. The PSD score is defined in and , based on Metzler and Croft’s work (, ). (1)P=∑qi∈Qf(qi, D)(2)f(qi,D)=log(I(tfqi,D>0)(tfqi,D+δ)+μcfqi|C||D|+μ)In , P is a sortable quantifier of relevance. D is a dataset with metadata, Q is an input (e.g. a question), q are words in the input and fqi, D is the weight of q in the metadata of dataset D.In , tf is the number of times word q matches the metadata of dataset D, cfqi is the number of times word q matches the metadata of the entire collection of datasets, D is the word number of the metadata of dataset D, C is the total word number for the collection and μ is an empirical hyper-parameter that is set to 2500. Differently from the original algorithm, we added a constant δ=5, an empirical parameter to tfqi,D if it was >0 and I(tfqi,D>0) is an indicator function. This modification puts a higher weight on the existence of a word in the metadata than on the times the word occurs.The default version of the PSD model took as input the original Q, i.e. the free-text question. Therefore, we named this version ‘PSD-allwords’. We further developed a ‘PSD-keywords’ version that analysed only keywords extracted from Q. To identify valuable keywords from free-text questions, PSD-keywords firstly calls MetaMap (), a biomedical named entity recognizer, to identify the UMLS concepts from Q and then uses the UMLS concept set Q' as input to PSD, with the aim of eliminating the impact of less informative words in questions. In the experiments, we used the default setting of MetaMap, collected all recognized UMLS concepts and removed duplicated concepts. […]

Pipeline specifications

Software tools MetaMap, ABNER
Application Information extraction