LLM for Untargeted Adversarial Attack Against Language Models in Spanish

Adrián Moreno-Mu˜noz; L. Alfonso Ure˜na-López; Eugenio Martínez-Cámara

LLM for Untargeted Adversarial Attack Against Language Models in Spanish

Adrián Moreno-Mu˜noz, L. Alfonso Ure˜na-López, Eugenio Martínez-Cámara

Resumen

Language models face inherent security vulnerabilities where even subtle input modifications can manipulate their outputs, these weaknesses represent a significant concern. This research explores untargeted adversarial attacks against Spanish language models using a two-stage approach: identifying influential words in the decision-making process and replacing them with appropriate synonyms. The evaluation of the attack against pre-trained Spanish language models reveals that generative models, guided by XAI-selected salient words, can significantly alter their predictions.

Texto completo:

PDF

Nombre de usuario/a
Contraseña
No cerrar sesión