Discriminative Benchmarking of Spanish Language Models: Findings from the ODESIA Challenge 2024

Alejandro Benito-Santos, Roser Morante, Adrián Ghajari, Iker García-Ferrero, Robiert Sepúlveda-Torres, German Rigau, Rodrigo Agerri, Juan Pablo Consuegra-Ayala, Ernesto L. Estevanell-Valladares, Fabio Yáñez-Romero, Miquel Canal-Esteve, Yoan Gutiérrez, Rafael Muñoz-Guillena, Manuel Palomar, Eva Sánchez Salido, Guillermo Marco, Andrés Fernández García, Víctor Fresno, Enrique Amigó, Laura Plaza, Jorge Carrillo-de-Albornoz, Miguel Lucas, Julio Gonzalo

Resumen


This paper presents the results from the 2024 ODESIA Challenge, a public competition aimed at benchmarking natural language processing (NLP) systems in Spanish across ten discriminative tasks using a standardized methodology based on private, held-out test sets. Results show the winning system (Qwen2.5-14B) prevailed due to structural advantages in extractive Question Answering, whereas encoders outperformed LLMs in other tasks such as sequence labeling and soft classification. We conclude that, while generative models may dominate reasoning-heavy tasks involving long contexts, encoder architectures obtain on-par or even better performance in many other discriminative scenarios, challenging the assumption that massive scale universally supersedes specialized architectural design.

Texto completo:

PDF