The GeMTeX Corpus
Back to 'My PhD' or Back to Main Page
Take Home Messages
- Problem: There are only four, small-sized real-world clinical text corpora in German
- The GeMTeX Corpus:
- Multi-centric initiative (6 German University Hospitals)
- More than 15k de-identified & 300 semantically annotated documents (February 2026)
- Legal blueprint for similar real-world medical corpus project under EU law
- Controlled access to the resource and structured patient information is possible
Abstract
GeMTeX is a large-scale German Medical Text Corpus project with the goal to publish a clinical national reference corpus. The resource is currently under construction and comprises, as of February 2026, more than 15k clinical documents (20M tokens) from six German university hospitals. When building GeMTeX, attention was paid to comply with European regulatory requirements. In phase I, patients were asked to allow reuse of their clinical documents based on the legal foundation of an "informed consent". In phase II, consented documents from six major clinical sites in Germany underwent a thorough de-identification process. In phase III, we currently enrich this unlocked dataset with semantic information from the clinical domain. This annotation process is guided by Snomed CT, which supports to directly ground expressions within clinical documents in a worldwide shared medical documentation and ontology standard. The resource is currently under active development and is accessible upon request under controlled access conditions. We refer interested researchers to visit https://kiinformatik.mri.tum.de/en/gemtex or reach out via gemtex.mi@mh.tum.de.
Links
Cite the paper
Please use the following citation to cite our paper:
@inproceedings{hofenbitzer-etal-2026-developing,
title = {Developing the German Medical Text Corpus (GeMTeX): Legal Compliance and Semantic Enrichment},
author = {Hofenbitzer, Justin and Lohr, Christina and Riedel, Andrea and Kiser, Rebekka and Shutsko, Aliaksandra and Abdelmalak, Abanoub and Klügl, Peter and Romberg, Jutta and Riepenhausen, Sarah and Schechner, Miriam and Faller, Jakob and Meineke, Frank and Modersohn, Luise and Löffler, Markus and Fluck, Juliane and Hahn, Udo and Schulz, Stefan and Boeker, Martin},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
month = {May},
year = {2026},
pages = {1571--1584},
address = {Palma, Mallorca, Spain},
publisher = {European Language Resources Association (ELRA)},
doi = {10.63317/4eqiegnqbu96}
}