Assessing a Literary RAG System with a Human-Evaluated Synthetic QA Dataset Generated by an LLM: Experiments with Knowledge Graphs

Yanco Amor Torterolo-Orta, Sofía Micaela Roseti, Antonio Moreno-Sandoval

Resumen


This paper explores the use of an LLM-generated QA dataset to evaluate a RAG system, and presents experiments involving Knowledge Graphs to improve retrieval over literary pieces in the context of Digital Humanities. The RAG system leverages a custom Neo4j database containing the text of the Spanish literary work Trafalgar, by Benito P´erez Gald´os. This posed the challenge of finding a suitable evaluation method for the system, which led to the generation of a synthetic dataset from the same book. Several models were used to create different versions of the dataset, which were then evaluated by four linguists (human evaluation), enabling comparisons between models. DeepEval RAG metrics were used to evaluate the system with the dataset version that obtained the highest score. Additionally, this work describes some retrieval techniques, such as text-to-Cypher generation and few-shot prompting.

Texto completo:

PDF