5 - GeMTeX
General Information about the Project
Find official information about GeMTeX on the official website!
One of the most relevant communication media in clinical everyday routines is texts: Doctors' letters, discharge summaries, or other sorts of reports are created to deliver valuable information regarding a patient's treatments or state of health. Privacy risks, lack of standardization, and limited resources have hindered researchers from investigating the composition of German clinical language or leveraging clinical texts for natural language processing (NLP) methods. The latter is, however, a promising field that may assist or simplify clinical routines and disburden hospital medical staff (for a more detailed description, cf. Meineke et al. 2023). To target these shortcomings, the project German Medical Text Corpus (GeMTeX) aims to release the largest clinical German text corpus from six German university hospitals with semantic medical annotations (Charité Berlin, University Hospital Dresden, University Hospital Erlangen, University Hospital Essen, University Hospital Leipzig, TUM University Hospital Munich). Other project partners contribute software or medical domain expertise. GeMTeX is part of the German Medical Informatics Initiative (MII).
The Annotation Process
Every clinical document undergoes two major steps before being added to the final corpus: In the first step, the documents are de-identified before being semantically annotated with the knowledge base SNOMED CT. For both steps, we use the platform INCEpTION.
Step 1: The De-Identification
To make clinical texts accessible to the public, all personally identifiable information (PII) must be removed. The project developed the first standardized de-identification guideline for German texts (cf. Lohr et al. 2024). To be as accurate as possible, two people manually de-identify the texts, aided by an automatic recommender system powered by Averbis Health Discovery (AHD).
Step 2: The Semantic Annotation
Once the PII elements have been removed from the clinical texts, they are semantically annotated. Semantic annotation in GeMTeX means SNOMED CT concepts are added to suitable text spans. SNOMED CT can be described as the most powerful clinical, ontology-based terminology covering systematic descriptions of medical procedures, conditions, substances, and so forth. The semantic annotation, therefore, directly grounds the information contained in the texts by mapping text spans to existing and trackable concepts in SNOMED CT. The annotation is semi-automatic: The texts are pre-annotated by AHD, and a person with a medical background refines the annotations. Furthermore, a recommender system by ID Logic suggests alternative concepts for each annotated span.
If you're interested in the project, particularly in the semantic annotation guidelines, please contact me.