Discriminative Benchmarking of Spanish Language Models: Findings from the ODESIA Challenge 2024
Resumen
This paper presents the results from the 2024 ODESIA Challenge, a public competition aimed at benchmarking natural language processing (NLP) systems in Spanish across ten discriminative tasks using a standardized methodology based on private, held-out test sets. Results show the winning system (Qwen2.5-14B) prevailed due to structural advantages in extractive Question Answering, whereas encoders outperformed LLMs in other tasks such as sequence labeling and soft classification. We conclude that, while generative models may dominate reasoning-heavy tasks involving long contexts, encoder architectures obtain on-par or even better performance in many other discriminative scenarios, challenging the assumption that massive scale universally supersedes specialized architectural design.


