On the Impact of Cross-Domain Data on German Language Models

Abstract

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.

Publication
Findings of the Association for Computational Linguistics: EMNLP 2023
Ahmad Idrissi-Yaghir
Ahmad Idrissi-Yaghir
Researcher in the first cohort

My research interests include Deep Learning, Natural Language Processing, and Information Retrieval.

Christoph M. Friedrich
Christoph M. Friedrich
Co-Speaker

My research interests include Deep Learning, Computer Vision, Radiomics, and Explainable AI.

Jens Kleesiek
Jens Kleesiek
Principal Investigator
Next
Previous