Computational protocol: Clinical Natural Language Processing in languages other than English: opportunities and challenges

Similar protocols

Protocol publication

[…] New NLP systems or components Some of the work in languages other than English addresses core NLP tasks that have been widely studied for English, such as sentence boundary detection [], part of speech tagging [–], parsing [, ], or sequence segmentation []. Word segmentation issues are more obviously visible in languages which do not mark word boundaries with clear separators such as white spaces. This is the case, for instance, in Chinese, Japanese, Vietnamese and Thai. A study of automatic word segmentation in Japanese addressed the lack of spacing between words in this language []. The authors implemented a probabilistic model of word segmentation using dictionaries. Abbreviations are common in clinical text in many languages and require term identification and normalization strategies. These have been studied for Spanish [], Swedish [], German [, ] and Japanese []. More complex semantic parsing tasks have been addressed in Finnish [] through the addition of a PropBank layer [] to clinical Finnish text parsed by a dependency parser [].Core NLP tasks are sometimes evaluated as part of more complex tasks. For instance, a study on Hebrew medical text shows that segmentation methods accounting for transliterated words yield up to 29% performance improvement in medical term extraction []. Word segmentation was also shown to outperform character segmentation for named entity recognition in Chinese clinical text. In addition, performing segmentation and named entity recognition jointly yielded a 1% improvement for both. The overall performance of named entity recognition using these special features was above 0.90 F1-measure for four entity types, a performance comparable to English state-of-the-art [, ]. Conversely, in an effort addressing the expansion of English abbreviations in Japanese text [] a study on eight short forms associated to two or more long forms found that character (vs. word) segmentation performed better for the task. However, it can be argued that in the context of code-switching and transliteration (English abbreviations appeared verbatim in Japanese text, accompanied by an expanded form of the acronym in Japanese), the distribution of words and characters made the text sufficiently different from standard Japanese to warrant specific processing. Cohen et al. [] studied the impact of the high frequency of transliterated terms in Hebrew clinical narratives. They report that the use of a semi-automatically acquired medical dictionary of transliterated terms improves the performance of information extraction. The effect of spelling correction and negation detection on an ICD10 coding system was studied for Danish and both features were found to yield improved performance [].Lexicons, terminologies and annotated corpora While the lack of language specific resources is sometimes addressed by investigating unsupervised methods [, ], many clinical NLP methods rely on language-specific resources. As a result, the creation of resources such as synonym or abbreviation lexicons [, , ] receives a lot of effort, as it serves as the basis for more advanced NLP and text mining work.Distributional semantics was used to create a semantic space of Japanese patient blogs, seed terms from the categories Medical Finding, Pharmaceutical Drug and Body Part were used to expand the vocabularies with promising results [].There is sustained interest in terminology development and the integration of terminologies and ontologies in the UMLS [], or SNOMED-CT for languages such as Basque []. In other cases, full resource suites including terminologies, NLP modules, and corpora have been developed, such as for Greek [] and German [].The development of reference corpora is also key for both method development and evaluation. Recently, researchers produced annotated corpora for tasks such as machine translation [, ], de-identification in French [] and Swedish [], drug-drug interaction in Spanish [], named entity recognition and normalization for French [], and also for linguistic elements such as verbal propositions and arguments for Finnish []. The study of annotation methods and optimal uses of annotated corpora has been growing increasingly with the growth of statistical NLP methods [, , ].For some languages, a mixture of Latin and English terminology in addition to the local language is routinely used in clinical practice. This adds a layer of complexity to the task of building resources and exploiting them for downstream applications such as information extraction. For instance, in Bulgarian EHRs medical terminology appears in Cyrillic (Bulgarian terms) and Latin (Latin and English terms). This situation calls for the development of specific resources including corpora annotated for abbreviations and translations of terms in Latin-Bulgarian-English []. The use of terminology originating from Latin and Greek can also influence the local language use in clinical text, such as affix patterns [].Multilingual corpora are used for terminological resource construction [] with parallel [–] or comparable [, ] corpora, as a contribution to bridging the gap between the scope of resources available in English vs. other languages. More generally, parallel corpora also make possible the transfer of annotations from English to other languages, with applications for terminology development as well as clinical named entity recognition and normalization []. They can also be used for comparative evaluation of methods in different languages [].A notable use of multilingual corpora is the study of clinical, cultural and linguistic differences across countries. A study of forum corpora showed that breast cancer information supplied to patients differs in Germany vs. the United Kingdom []. Furthermore, a study of clinical documents in English and Chinese evidenced a lower density of treatment concepts in Chinese documents [] which was interpreted as a reflection of cultural differences between clinical narrative styles and suggests that this needs to be accounted for when designing clinical NLP systems for Chinese.Conversely, a comparative study of intensive care nursing notes in Finnish vs. Swedish hospitals showed that there are essentially linguistic differences while the content and style of the documents is similar []. Adapting NLP architectures developed for English Studying sublanguages, Harris [] observed that “The structure of each science language is found to conform to the information in that science rather than to the grammar of the whole language.” Sager’s LSP system [], developed for the syntactic analysis of medical English, was adapted to French []. Deléger et al. [] also describe how a knowledge-based morphosemantic parser could be ported from French to English.This shows that adapting systems that work well for English to another language could be a promising path. In practice, it has been carried out with varying levels of success depending on the task, language and system design. The importance of system design was evidenced in a study attempting to adapt a rule-based de-identification method for clinical narratives in English to French []. Language-specific rules were encoded together with de-identification rules. As a result, separating language-specific rules and task-specific rules amounted to re-designing an entirely new system for the new language. This experience suggests that a system that is designed to be as modular as possible, may be more easily adapted to new languages. As a modular system, cTAKES raises interest for adaptation to languages other than English. Initial experiments in Spanish for sentence boundary detection, part-of-speech tagging and chunking yielded promising results []. Some recent work combining machine translation and language-specific UMLS resources to use cTAKES for clinical concept extraction from German clinical narrative showed moderate performance []. More generally, the use of word clusters as features for machine learning has been proven robust for a number of languages across families [].Similarly to work in English, the methods for Named Entity Recognition (NER) and Information Extraction for other languages are rule-based [, ], statistical, or a combination of both []. With access to large datasets, studies using unsupervised learning methods can be performed irrespective of language, as in Moen et al. [] where such methods were applied for information retrieval of care episodes in Finnish clinical text. Knowledge-based methods can be applied when terminologies are available, e.g. extending information contained in structured data fields with information from Danish clinical free-text with dictionary-based approaches for the study of disease correlations [] or adverse events []. For German, extracting information from clinical narratives for cohort building using simple rules was successful [].NER essentially focuses on two types of entities: personal health identifiers in the context of clinical document de-identification [, , , , –] and clinical entities such as diseases, signs/symptoms [], procedures or medications [, –], as well as their context of occurrence: negation [], assertions [, ] and experiencer (i.e. whether the entities are relevant to the patient or a third party such as a family member or organ donor).Systems addressing a task such as negation may be easily adapted between languages of the same family that express negation using similar syntactic structures as is the case for English and Swedish [, ], English and German [], English and Spanish [, ], or even English, French, German and Swedish []. However, it can be difficult to pinpoint the reason for differences in success for similar approaches in seemingly close languages such as English and Dutch [].Another important contextual property of clinical text is temporality. Heideltime is a rule-based system developed for multiple languages to extract time expressions []. It has been adapted for clinical text in French [] and Swedish [].Global concept extraction systems for languages other than English are currently still in the making (e.g. for Dutch [], German [] or French [, ]).The entities extracted can then be used for inferring information at the sentence level [] or record level, such as smoking status [], thromboembolic disease status [], thromboembolic risk [], patient acuity [], diabetes status [], and cardiovascular risk []. [...] Clinical NLP in any language relies on methods and resources available for general NLP in that language, as well as resources that are specific to the biomedical or clinical domain.In this respect, English is by far the most resource-rich language, with advanced tools dedicated to the biomedical domain such as part-of-speech taggers (e.g. MedPOST []), parsers (e.g. GATE [], Charniak-McClosky [], enju []), biomedical concept extractors (e.g. MetaMap [], cTAKES [, ], NCBO []). For other languages, data and resources are sometimes scarce.The UMLS (Unified Medical Language System []) aggregates more than 100 biomedical terminologies and ontologies. In its 2016AA release, the UMLS Metathesaurus comprises 9.1 million terms in English followed by 1.3 million terms in Spanish. For all other languages, such as Japanese, Dutch or French, the number of terms amounts to less than 5% of what is available for English. Additional resources may be available for these languages outside the UMLS distribution. Details on terminology resources for some European languages were presented at the CLEF-ER evaluation lab in 2013 [] for Dutch [], French [] and German [].Medical ethics, translated into privacy rules and regulations, restrict the access to and sharing of clinical corpora. Some datasets of biomedical documents annotated with entities of clinical interest may be useful for clinical NLP []. However, there are currently no sharable clinical datasets comparable to the i2b2 datasets [, ], the ShARe corpus [], the THYME corpus [, ] or the MIMIC corpus [] in languages other than English except the Turku Clinical TreeBank and PropBank [, , ] in Finnish and the small subset of 100 patient pseudonymized records in the Stockholm EPR PHI Pseudo Corpus [] in Swedish, and the examinations clinical texts of the MedNLPDoc corpus in Japanese [], albeit only with document-level annotation.Past experience with shared tasks in English has shown international community efforts were a useful and efficient channel to benchmark and improve the state-of-the-art []. The NTCIR-11 MedNLP-2 [] and NTCIR-12 MedNLPDoc [] tasks focused on information extraction from Japanese clinical narratives to extract disease names and assign ICD10 codes to a given medical record. The CLEF-ER 2013 evaluation lab [] was the first multi-lingual forum to offer a shared task across languages. It resulted in a small multi-lingual manually-validated reference dataset [] and prompted the development of a large gold-standard annotated corpus of clinical entities for French [], currently in use in a clinical named entity recognition and normalization task in the CLEF eHealth evaluation lab [, ]. Our hope is that this effort will be the first in a series of clinical NLP shared tasks involving languages other than English. The establishment of the health NLP Center as a data repository for health-related language resources ( will enable such efforts.In summary, there is a sharp difference in the availability of language resources for English on one hand, and other languages on the other hand. Corpus and terminology development are a key area of research for languages other than English as these resources are crucial to make headway in clinical NLP. […]

Pipeline specifications

Software tools cTakes, MedPost, MetaMap
Application Information extraction