RoRED: A Romanian Relation Extraction Dataset

George-Andrei Dima; Ilie Cosmin Biltan; Luciana Morogan

doi:10.20944/preprints202605.0578.v1

Submitted:

08 May 2026

Posted:

11 May 2026

You are already at the latest version

Abstract

Relation extraction is an important task for structuring information from unstructured text. However, Romanian language still lacks dedicated datasets and benchmarks for this task. To address this gap, we introduce RoRED, a Romanian relation extraction dataset built by combining two complementary data construction strategies: translating existing high-quality English resources and applying distant supervision to native Romanian Wikipedia data. We leverage a powerful open-source large language model to automatically translate English examples into Romanian. For the native subset, we align Romanian Wikipedia entities with Wikidata relations to obtain naturally occurring Romanian examples. To better reflect real-world relation extraction scenarios, we also introduce synthetic negative examples generated using existing Romanian named entity recognition models. Finally, we validate the dataset by fine-tuning and evaluating multiple baseline models. Our strongest model, LUKE-RoRED, achieves a macro-F1 score of 0.8744 on the RoRED test set, demonstrating that the dataset can support relation extraction for Romanian. Overall, RoRED provides a strong first native benchmark for Romanian relation extraction.

Keywords:

relation extraction

;

Romanian NLP

;

low-resource languages

;

dataset construction

;

distant supervision

;

machine translation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

RoRED: A Romanian Relation Extraction Dataset

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe