A Transformer-Based Pipeline for German Clinical Document De-Identification

Abstract

Objective: Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task. Methods: We utilized a substantial dataset of 10,240 German hospital documents from 1,130 patients, created as part of the investigating hospital’s routine documentation, as training data. Our approach involved fine-tuning and training an ensemble of two transformer-based language models simultaneously to identify sensitive data within our documents. Annotation Guidelines with specific annotation categories and types were created for annotator training. Results: Performance evaluation on a test dataset of 100 manually annotated documents revealed that our fine-tuned German ELECTRA (gELECTRA) model achieved an F1 macro average score of 0.95, surpassing human annotators who scored 0.93. Conclusion: We trained and evaluated transformer models to detect sensitive information in German real-world pathology reports and progress notes. By defining an annotation scheme tailored to the documents of the investigating hospital and creating annotation guidelines for staff training, a further experimental study was conducted to compare the models with humans. These results showed that the best-performing model achieved better overall results than two experienced annotators who manually labeled 100 clinical documents.

Publication
Applied Clinical Informatics
Kamyar Arzideh
Kamyar Arzideh
Associated Researcher

My research interests include NLP.

Giulia Baldini
Giulia Baldini
Associated Researcher

My research interests include NLP.

Felix Nensa
Felix Nensa
Speaker

My research interests include medical digitalization, computer vision and radiology.

Ahmad Idrissi-Yaghir
Ahmad Idrissi-Yaghir
Researcher in the first cohort

My research interests include Deep Learning, Natural Language Processing, and Information Retrieval.

René Hosch
René Hosch
Associated Researcher

My research interests include distributed Computer Vision, Generative Adversarial Networks and Image-to-Image translation.

Next
Previous