TripLegal-CL: A Multi-Jurisdictional Spanish Legal Corpus for Contrastive Training of Dense Retrieval Models
Resumen
Dense legal case retrieval in Spanish requires a structured dataset to train bi-encoder models. However, most existing Spanish legal resources have been designed for classification or entity extraction tasks and do not provide training data tailored to dense retrieval. In this work, we present TripLegal-CL, a multi-jurisdictional corpus of 592,382 contrastive instances structured for contrastive learning, automatically generated from 148,637 publicly available legal documents using an LLM. On this basis, to assess the usefulness of the resource, we fine-tune multilingual bi-encoder models through contrastive learning using the generated data and compare them with their baseline versions. The fine-tuned models achieve improvements of up to +18.2 percentage points in Acc@1 and +15.3 percentage points in MAP@100. These results confirm that the corpus provides effective training data for the contrastive fine-tuning of dense retrievers in the legal domain.


