Exploring Linguistic Features in a New Readability Corpus for Spanish
Resumen
The reading difficulty of a given text has traditionally been calculated using readability formulas, which measure some linguistic properties of texts and provide a score. Current methods for automatic readability assessment are mostly based on supervised models which use manually defined linguistic features learned from texts classified by readability levels. While reference corpora are available for various languages, existing resources for Spanish are often limited in genre diversity, and primarily designed for tasks like text simplification or teaching Spanish as a foreign language, making them less suitable for training classifiers. This paper presents a new readability corpus for Spanish, which contains 2,563 texts from 11 categories and 68 subcategories, manually classified into four levels of readability. Its compilation and topic selection was specifically defined for adult readers, with a focus on automatic classification tasks. This study also analyzes the most relevant linguistic properties regarding each of the levels, and explores the use of language models’ surprisal as a readability predictor, whose correlation with the levels indicates its usefulness for training automatic classifiers.