Creating an Image Description Model Specialized in Greek Archaeology

Enrique Garcia-Arias, Ana Garcia-Serrano

Resumen


The automated generation of image descriptions (IM, Image Captioning) has seen significant progress in recent years with the integration of LLMs (Large Language Models). In generalist contexts, the results are quite accurate; however, challenges remain substantial in specialized domains, as exemplified by the Arqueogriegos project. The multimodal corpus of this study comprises photos, plans, and texts within an archaeological context, encompassing sites, artifacts, and their historical environment—a particularly complex domain due to the difficulty of interpreting these decontextualized images, lacking an adequate descriptive text (caption). The primary objective of this study is to generate optimized automatic descriptions that address the disconnect between images and texts, tackling the limitations of isolated archaeological images. To achieve this, rather than relying on direct solutions or APIs, which have proven insufficient for the problem's complexity, an innovative methodology was designed, breaking down key components into phases and evaluating and implementing the most effective solution at each stage. This approach constitutes the main contribution of the work, overcoming the shortcomings of existing IM and multimodal LLM models.

Texto completo:

PDF