Automatic and Manual Evaluation of a Spanish Suicide Information Chatbot
Resumen
Chatbots have a great potential in sensitive fields like mental health; however, a careful evaluation, either by manual or automatic methods is a must to ensure the reliability of these systems. In this work, a library for automatically evaluating Spanish Retrieval Augmented Generation (RAG) chatbots using Large Language Models (LLMs) is presented. Then, a thorough analysis of several LLMs candidates to be used in a RAG system which provides suicide prevention information is conducted. Towards that aim, we use a manual evaluation, an automatic evaluation based on metrics, and an automatic evaluation based on LLMs. All evaluation methods agree on a preferred model, but they exhibit subtle differences. Automatic methods may overlook unsafe answers; the automatic methods based on metrics are correlated on precision and completeness with human evaluation but not on faithfulness; and some automatic methods based on LLMs do not detect some errors. As a general conclusion, even if automatic methods can reduce manual evaluation efforts, manual evaluation remains essential, particularly in sensitive contexts like those related to mental health.