LLM for Untargeted Adversarial Attack Against Language Models in Spanish

Adrián Moreno-Mu˜noz, L. Alfonso Ure˜na-López, Eugenio Martínez-Cámara

Resumen


Language models face inherent security vulnerabilities where even subtle input modifications can manipulate their outputs, these weaknesses represent a significant concern. This research explores untargeted adversarial attacks against Spanish language models using a two-stage approach: identifying influential words in the decision-making process and replacing them with appropriate synonyms. The evaluation of the attack against pre-trained Spanish language models reveals that generative models, guided by XAI-selected salient words, can significantly alter their predictions.

Texto completo:

PDF