Generating Multiple-Choice Questions in Spanish and Basque using LLMs: A Comparative Manual Evaluation

Maddalen López de Lacalle, Xabier Saralegi, Aitzol Saizar

Resumen


Multiple-Choice Questions (MCQs) are widely applied across various domains, such as education and assessing the technical skills of staff in companies. However, creating such questions manually is challenging and time-consuming, especially for specialized fields. In this paper, we explore how generative large language models (LLMs) can be exploited to generate MCQs from instructional texts that serve as tests for vocational qualification assessment. We focus on two topics—basic first aid and production scheduling in companies—for which we created two datasets of parallel course texts in Spanish and Basque. The manual evaluation reveals that both the open-source Llama3 instructed models (8B and 70B) and the proprietary GPT-4o can generate MCQs of acceptable quality in a zero-shot setting for Spanish. No significant differences were observed in performance based on model size or licensing type, with performance rates of 91%, 84%, and 80% for GPT-4o, Llama3- 70B, and Llama3-8B, respectively. However, the results for Basque show a marked decline, with performance dropping to 70% for GPT-4o and 59% for Llama3-70B, and a notably low 27% for Llama3-8B. Finally, few-shot generation using Basqueadapted Llama-eus-8B foundational model shows promising potential.

Texto completo:

PDF