Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models

Pablo Ruiz, Montse Cuadros, Thierry Etchegoyhen

Resumen


This paper presents a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system is an improvement on the tool we submitted to the Tweet-Norm 2013 shared task, and results on the tasks test-corpus are above-average. Additionally, we provide a study of the impact for tweet normalization of the different components of the system: rule-based, edit-distance based and statistical.

Texto completo:

PDF