IBERMAT - Corpus of Human and Machine Translated Multi-Domain Content in Basque, Catalan, Galician and Spanish: Description and Exploitation
Resumen
Distinguishing between human- and machine-produced text is crucial for tasks like authorship verification, content moderation, and quality assessment. We introduce IBERMAT, a novel dataset of human and machine translations across three specialised domains (clinical, legal and literary) and four official languages in Spain (Basque, Catalan, Galician and Spanish) and outlines a case study of its exploitation. We evaluate the performance of classifying translation origin using a range of machine learning techniques. We evaluate three approaches: (1) traditional machine learning pipelines, (2) fine-tuned transformer-based language models using full and low-rank adaptation strategies, and (3) LLMs for zero-shot classification. The results show that fine-tuned transformers outperform both traditional ML and zero-shot LLMs, but not with substantial differences. These results highlight both the increasing quality of MT output and the limitations of current models in detecting subtle distinctions, especially when translations may involve post-editing. Our findings also suggest that machine-translated content may be harder to identify than general AI-generated text.


