A Part-of-Speech Tag Clustering for a Word Prediction System in Portuguese Language

Daniel Cruz Cavalieri , Teodiano Freire Bastos Filho , Mário Sarcinelli Filho , Sira Elena Palazuelos Cagigas , Javier Macias-Guarasa , José L. Martín Sánchez


This paper presents an automatic method for reducing the part-of-speech tagset to be considered by a word prediction system in Portuguese. The method is based on a similarity measure applied to a association matrix, generated by employing a odds ratio association measure in the bigrams of parts-of-speech (bipos) probability distribution in a corpus. The results reported in this paper show that using the proposed clustering method with an appropriate threshold value over the similarity has the potential to improve the word prediction system. Moreover, it makes possible to use new clustering techniques such as fuzzy clustering. The results also show that when using a word prediction system based on a syntactic model, the clustering cannot be performed between the major syntactic categories, even if the clusters generated seem correct from a linguistic point of view.

Texto completo: