Computational protocol: A Semi-Supervised Learning Approach to Enhance Health Care Community–Based Question Answering: A Case Study in Alcoholism

Similar protocols

Protocol publication

[…] Given a question, it was divided into subquestions and matched with the question group using the aforementioned rule-based approach. Then, we computed the semantic distance between the prospective question and all other questions from the training sets belonging to the same group. Two distance approaches were used in our work.1. DTW-based approach: It is based on a sequence alignment algorithm known as DTW , which uses efficient dynamic programming to calculate a distance between 2 temporal sequences. This allows us to effectively encode the word order without adversely penalizing for missing words (such as in a relative clause). Applying it in our context, a sentence was considered as a sequence of words where the distance between each word was computed by the Levenshtein distance at a character level [,]. For any 2 sequences defined asSeq1 = < w11, w21,…, wm1> and Seq2 = < w12, w22,…, wn2> where m and n are the lengths of the sequences, Liu et al [] defined the distance between 2 sequences (in our case, 2 sentences) as in the following :where f (0,0) = 0, f (i, 0) = f (0, j) = ∞, i ∈ (0, m), j ∈ (0, n)Here, d (wi1, wj2) is the distance between 2 words computed by the Levenshtein measure.2. Vector-space based approach: An alternative paradigm is to consider the sentences as a bag of words, represent them as points in a multidimensional space of individual words, and then calculate the distance between them. We implemented a unigram model with tf-idf weights based on the prospective question and other questions in the same category and computed the Euclidean distance measure.We further took into account the cases that share similar medical information by multiplying the distances with a given weight parameter. The best value of the weight parameter was selected based on extensive experiments. The MetaMap tool was used to recognize UMLS concepts occurring in questions []. If at least 1 word in the UMLS concepts of “organic chemical” and “pharmacologic substance” occurs in both the prospective question and a training question, we reduce the distance to account for the additional semantic similarity. These UMLS concepts are specifically selected as we want to provide more weight to answers that mention a treatment approach under the intuitive assumption that most CQA users aim to seek informative advice for their illness. The set of semantic types can be expanded to capture broader concepts if different domains are considered.The QA pairs in the training set corresponding to the smallest and the second smallest distance were extracted. Thus, we finally obtained a list of candidate answers, that is, the answers referring to smallest and second smallest questions, for each prospective question. These answers were used as the output of the baseline rule -based system. This was repeated for each question in the training set, that is, the prospective question corresponds to each question in the training set. At the end of this phase, we had triplets (Qp, Qt, At) over all questions Qp. Note that At is an answer to question Qt with Qt ≠ Qp, and each Qp yielded several such triplets.The machine-learning phase of answer re-ranking (phase II) is described next. The goal of this phase is to rank candidate answers from the previous step and select the best answer among them. Each triple (Qp, Qt, At) is aimed to be assigned as “valid” if At is a valid answer to Qp, or “invalid” otherwise. We describe how the model was trained in this section while detailed explanations (eg, number of labeled and unlabeled triplets) are provided in the section, “Results.” We first selected a small random subset of triplets and labeled them manually (there were too many to label all of them in this way). Both supervised and semi-supervised learning EM models were developed to predict the answerability of newly posted question and rank candidate answers. Specifically, the semi-supervised learning model was trained on labeled and unlabeled triplets. According to the semi-supervised learning model, we first trained a supervised learning algorithm including Neural Networks with the entropy objective function (NNET), Neural Networks with the L2-norm or least squares objective function (NNET_L2), support vector machine (SVM), and logistic regression based on manually labeling outputs from the aforementioned rule-based answer extraction phase. The trained model was used to classify the unlabeled part of the outputs of phase I, and then, the classifier was retrained based on the original labeled data and a randomly selected subset of unlabeled data using the estimated labels from the previous iteration. These steps were iteratively repeated to achieve a final estimated label. The supervised approach, on the other hand, only ran a classifier on the labeled subset and finished. A 10-fold cross validation was implemented in both semi-supervised and supervised approaches. Specifically, all labeled observations were partitioned into 10 parts where 1 part was set aside as a test set. The model was fitted based on the remaining 9 parts of the labeled observations (plus the entire unlabeled part for the semi-supervised learning approach). The parameters of the semi-supervised model were obtained by using the EM algorithm previously described. The fitted model was then used to predict the responses in the part that we set aside as the test set. These steps were repeated by selecting different part to set aside as the test set. All features used in the models are illustrated based on the following example as summarized in . […]

Pipeline specifications

Software tools asSeq, MetaMap
Application Information extraction