Assessing lexical ambiguity resolution in language models with new WiC datasets in Galician and Spanish

Marta Vázquez Abuín, Marcos Garcia

Resumen


Ambiguity resolution, particularly in addressing lexical phenomena such as polysemy, has been a long-standing challenge in NLP. From a computational point of view, this problem has traditionally been tackled through tasks such as word sense disambiguation and, more recently, with the appearance of Word-in-Context (WiC) datasets, which tackle polysemy resolution as a binary classification problem. These datasets play a crucial role in evaluating the lexical capabilities of vector models, but their availability is limited to only a few languages, creating a significant disadvantage for varieties lacking such resources. This paper introduces WiC datasets for Galician and Spanish, addressing the gap in the research on lexical ambiguity resolution for these languages. The datasets have a total of 4,300 instances, and their creation has followed the guidelines of the original English WiC. Besides introducing the datasets, we present a systematic evaluation of monolingual and multilingual transformer models across layers, exploring aspects such as data overlap, rogue dimensions, and cross-lingual transfer. The results reveal that (i) monolingual and multilingual models have comparable accuracy, (ii) vector normalization has little effect on the models’ performance, and (iii) cross-lingual transfer between Galician and Spanish is not effective. Among the evaluated models, Llama 3.2 seems to be the most effective at solving the task.

Texto completo:

PDF