Overview of PastReader at IberLEF 2025: Transcribing Texts From the Past

Arturo Montejo-Ráez, Elena Sánchez-Nogales, Gloria Expósito-Álvarez, L. Alfonso Ureña-López, María Teresa Martín-Valdivia, Jaime Collado-Montañez, Manuel Carlos Díaz-Galiano, Isabel Cabrera-de Castro, María Victoria Cantero-Romero, Rocío Ortuño-Casanova

Resumen


The PastReader 2025 task, within the framework of IberLEF 2025, focuses on the automatic transcription of digitized Spanish historical press. It uses as a basis the Digital Newspaper Library of the National Library of Spain, a collection that is part of the Hispanic Digital Library project and that gathers millions of pages of newspapers and magazines representative of the thematic and stylistic diversity of the Hispanic press. Although the documents are available in PDF with OCR, the quality of the extracted texts is often poor due to deteriorated scans, irregular page structures, old spelling, and other visual problems. To further automate this process, the task proposes two challenges: the correction of OCR errors and the generation of curated texts from scanned images, applying multimodal models. The main objective is to reduce the need for human intervention in mass digitization processes, promoting systems capable of improving the accessibility, recovery, and preservation of Spanish newspaper heritage through robust and efficient technological solutions.

Texto completo:

PDF