Procesamiento del Lenguaje Natural

Procesamiento del Lenguaje Natural http://journal.sepln.org/sepln/ojs/ojs/index.php/pln <div class="homeText">El objetivo principal de la revista es el de ofrecer a los investigadores en Procesamiento del Lenguaje Natural (PLN) una oportunidad para presentar nuevos trabajos, comunicar resultados, discutir problemas y obstáculos encontrados durante su trayectoria investigadora.Por otro lado, permitir intercambiar opiniones sobre directrices futuras de investigación básica y aplicación prevista por los expertos y contrastarlas con las necesidades reales del mercado. Reflexionar y debatir en profundidad sobre temas concretos de máxima actualidad tales como la extracción de información, la recuperación de información o la evaluación de sistemas de procesamiento del lenguaje natural.La Revista tiene una periodicidad semestral, publicándose dos números al año (marzo y septiembre) que recogen los últimos avances en PLN.La Revista cuenta con el sello de calidad de la Fundación Española para Ciencia y Tecnología (FECyT), el cual la certifica como revista de excelencia, y por lo tanto, incluida en el Repositorio de Revistas Científicas españolas (RECyT, Repositorio Español de Ciencia y Tecnología) <a href="http://recyt.fecyt.es/index.php/PLN">http://recyt.fecyt.es/index.php/PLN</a>La Revista de Procesamiento de Lenguaje Natural también ha recibido el sello de calidad (ISO9001) que la acredita como excelente durante un periodo de tres años (14 de marzo de 2012 al 14 de marzo de 2015).Procesamiento del Lenguaje Natural (edición impresa). ISSN: 1135-5948.Procesamiento del Lenguaje Natural (edición electrónica). ISSN: 1989-7553.</div> Sociedad Española para el Procesamiento del Lenguaje Natural es-ES Procesamiento del Lenguaje Natural 1135-5948 Identification of Complex Words in the Academic Domain in Spanish http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6851 This document presents a summary of the doctoral thesis conducted by Jenny Alexandra Ortiz Zambrano at the University of Jaén, under the supervision of Dr. Arturo Montejo Ráez. The thesis is framed within the field of Natural Language Processing and addresses the identification of complex words in academic texts in Spanish, a key task for improving reading comprehension and accessibility to information, especially for individuals with reading difficulties. The thesis defense took place on March 12, 2025, at 12:00, in the D1 Graduation Hall of Las Lagunillas Campus of the University of Jaén. The Examination Committee was composed of Dr. Rafael Valencia García (Chair), from the University of Murcia; Dr. Eugenio Martínez Cámara (Secretary), from the University of Jaén; and Dra. Paloma Moreda Pozo (Member), from the University of Alicante. The thesis was awarded the distinction of Cum Laude. Jenny Alexandra Ortiz Zambrano Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 323 326 New Avenues in Computational Irony Detection in Social Media http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6850 PhD thesis in Computer Science focused on irony detection in social media, written by Reynier Ortega Bueno under the supervision of Prof. Paolo Rosso, at the Universitat Politècnica de València. This thesis investigates irony detection as a multifaceted linguistic, computational, and social challenge, addressing multilingual variation, multimodality, and corpus bias. The work introduces an attentive LSTM architecture integrating linguistic and deep features for Spanish irony and satire detection, and proposes an end-to-end model combining textual and visual transformers for multimodal irony detection in social media content. This work further analyses topic bias in irony corpora, demonstrating its detrimental impact on model generalisation and showing gains achieved through bias identification and mitigation. The defense took place in Valencia, Spain, on July 25th, 2025. The doctoral committee was composed by Rafael Berlanga Llavori (Universitat Jaume I), Els Lefever (Ghent University), and Tony Veale (University College Dublin). The thesis received an international mention, an excellent qualification, and the distinction of Cum Laude. Reynier Ortega Bueno Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 319 322 La detección de préstamos léxicos como un problema de etiquetado de secuencias: datos, modelos y métodos de evaluación para la recuperación de anglicismos en castellano http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6849 Este artículo es un resumen de la tesis Lexical borrowing detection as a sequence labeling task. Data, modeling and evaluation methods for anglicism retrieval in Spanish llevada a cabo por Elena Álvarez Mellado bajo la supervisión de Julio Gonzalo (UNED) y Constantine Lignos (Brandeis University) en el programa de doctorado en Sistemas Inteligentes (en la especialidad de Acceso a la información multilingüe) de la Escuela Técnica Superior de Ingeniería Informática de la UNED. La tesis fue defendida el 27 de mayo de 2025 en Madrid. El tribunal estuvo formado por Iria da Cunha (UNED), Javier de la Rosa (Nasjonalbiblioteket) y Mariona Taulé (Universitat de Barcelona). La tesis fue calificada con sobresaliente cum laude y recibió la mención internacional. Elena Álvarez Mellado Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 315 318 Computational Approaches to Mental Health Disorders Detection from Social Media Texts, Images and Videos http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6848 This is the summary of the Ph.D. Thesis conducted by Ana-Maria Bucur under the supervision of Prof. Paolo Rosso and Prof. Liviu P. Dinu, developed under cotutelle between the Universitat Politècnica de València and the University of Bucharest. The thesis aimed to use computational models to identify linguistic patterns associated with mental health issues and to contribute to early detection efforts. The thesis defence took place on October 6th, 2025, in Bucharest. The defense committee included Prof. David Enrique Losada Carril (Universidade de Santiago de Compostela), Prof. Arturo Montejo-Ráez (Universidad de Jaén) and Prof. Dragoș Iliescu (University of Bucharest). The work was graded as “Excellent” by both universities. The Universitat Politècnica de València awarded the thesis with a “Cum Laude” mention, while the University of Bucharest granted the distinction of “Summa Cum Laude” and international recognition. Ana-Maria Bucur Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 311 314 The Many Facets of Hateful Content Detection: From Perspectivism to Bias http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6847 This is the summary of the PhD thesis in Computer Science written by Giulia Rizzi, under the supervision of Prof. Paolo Rosso and Prof. Elisabetta Fersini. The PhD was conducted under a cotutelle agreement between Universitat Politècnica de València (Spain) and Università degli Studi di Milano-Bicocca (Italy), awarding a double doctoral degree. The thesis defense took place in Milano, Italy, on February 26th, 2025, in the presence of a committee formed by Prof. Alberto Barrón-Cedeño (Università di Bologna, Italy), Prof. Craig Macdonald (University of Glasgow, Scotland), Prof. Giacomo Boracchi (Politecnico di Milano, Italy), and Prof. Matteo Palmonari (University of Milan-Bicocca, Italy). The thesis was awarded the distinction of Cum Laude and received the Doctor Europaeus recognition. Giulia Rizzi Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 307 310 Just like a woman: A comparative analysis of an LLM and human labels on sexism detection http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6846 This work presents a comparative analysis of labeling in sexism detection using ambiguous Spanish-language data selected with the Think Twice method. A subset of examples with content related to sexism was annotated by Mexican women from diverse sociocultural backgrounds and contrasted with labels produced by an LLM. The results show low agreement both among humans and between humans and the LLM, reflecting the interpretative variability inherent in subjective tasks. Despite this variability, the model tends to approximate the average human judgment. These findings highlight the need for annotation schemes and classification approaches that account for cultural and linguistic diversity rather than forcing a single correct interpretation in sensitive tasks such as sexism detection. Metztli Ramírez-González Delia Irazú Hernández-Farías Manuel Montes-y-Gómez Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 293 303 Wikipedia used as a semantic tagger: some preliminary results in Spanish http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6845 This paper describes a method based on data from Wikipedia for the automatic semantic tagging of common and proper nouns in context. We first predict the semantic category of each Wikipedia entry using a rule-based method that detects definition patterns, and then we generalize from there using a statistical model that associates semantic categories with elements of the entry. The evaluation of proper and common nouns in Spanish reveals a general precision of .82 and a recall of .77. One feature of the method is its conceptual simplicity and computational efficiency. The implementation is offered as open-source code and the data used in the study is in the public domain. Rogelio Nazar Irene Renau Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 279 292 TripLegal-CL: A Multi-Jurisdictional Spanish Legal Corpus for Contrastive Training of Dense Retrieval Models http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6844 Dense legal case retrieval in Spanish requires a structured dataset to train bi-encoder models. However, most existing Spanish legal resources have been designed for classification or entity extraction tasks and do not provide training data tailored to dense retrieval. In this work, we present TripLegal-CL, a multi-jurisdictional corpus of 592,382 contrastive instances structured for contrastive learning, automatically generated from 148,637 publicly available legal documents using an LLM. On this basis, to assess the usefulness of the resource, we fine-tune multilingual bi-encoder models through contrastive learning using the generated data and compare them with their baseline versions. The fine-tuned models achieve improvements of up to +18.2 percentage points in Acc@1 and +15.3 percentage points in MAP@100. These results confirm that the corpus provides effective training data for the contrastive fine-tuning of dense retrievers in the legal domain. Wilfredo Ivan Martel Socola Christian Raul Salamea Palacios Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 235 277 Exploring the Impact of Linguistic Features on Pre-trained Models for Machine-generated Text Detection in Spanish http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6843 Detecting machine-generated Spanish text remains challenging across domains and generators. Pre-trained models like RoBERTa provide strong contextual embeddings but often underperform on human-authored texts and are sensitive to domain shifts. In this work, we integrate linguistic features from PUCPMetrix—covering lexical, syntactic, semantic, psycholinguistic, and cohesion properties—with pre-trained models. We evaluate feature-based classifiers, fine-tuned RoBERTa, hybrid models, and ensembles on the AuTexTification dataset. Hybrid models improve human-text detection (F1 65.49 vs. 60.74 for RoBERTa) and machine-text classification (F1 81.76), while a voting ensemble achieves the highest macro-F1 (74.75) and strongest robustness. Analyses indicate linguistic features provide stable, interpretable anchors, reducing overfitting and enhancing generalization across LLM outputs. Results demonstrate that combining linguistic and pre-trained models yields a robust solution for Spanish machine-generated text detection. Javier Alonso Villegas Luis Marco Antonio Sobrevilla Cabezudo Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 251 263 Preservando la Identidad en el Habla: Transcripción Anonimizada para el Contexto Colombiano http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6842 El habla ha motivado el desarrollo de modelos de Reconocimiento Automático del Habla (ASR, del inglés Automatic Speech Recognition) como Whisper, capaces de convertir el habla en texto escrito. Sin embargo, estos modelos requieren grandes volúmenes de datos (corpus), lo que limita su desempeño en idiomas o variantes con recursos limitados, como el español de Colombia, cuyos acentos y regionalismos están poco representados. Asimismo, el uso de grabaciones suele incluir información sensible, como nombres o identificaciones, que dificulta la recopilación e intercambio de estos corpus. Este trabajo propone desarrollar un modelo basado en la arquitectura de Whisper y el flujo de trabajo de WhisperX para la transcripción de voz anonimizada en el español colombiano, con anotación temporal y diarización de hablantes. Con modelos que alcanzan un 7,60% de error de transcripción a nivel de palabra (WER), un F1-score de 60,81% para reconocimiento de entidades y un F1-score de 76,10% en anonimización, se aporta al cierre de la brecha entre los modelos existentes y los dialectos colombianos, garantizando un desempeño robusto incluso en entornos con datos escasos. Andrea Juliana Parra Ariza Hoover Rueda-Chacón Copyright (c) 2026 Procesamiento del Lenguaje Natural 2026-03-30 2026-03-30 76 239 250