Topic Modeling and Word Sense Disambiguation on the Ancora corpus

Ruben Izquierdo, Marten Postma, Piek Vossen


In this paper we present an approach to Word Sense Disambiguation based on Topic Modeling (LDA). Our approach consists of two diferent steps, where first a binary classifier is applied to decide whether the most frequent sense applies or not, and then another classifier deals with the non most frequent sense cases. An exhaustive evaluation is performed on the Spanish corpus Ancora, to analyze the performance of our two{step system and the impact of the context and the different parameters in the system. Our best experiment reaches an accuracy of 74.53, which is 6 points over the highest baseline. All the software developed for these experiments has been made freely available, to enable reproducibility and allow the re-usage of
the software.

Texto completo: