EriBERTa Private Surpasses her Public Alter Ego: Enhancing a Bilingual Pretrained Encoder with Limited Private Medical Data
Resumen
The secondary use of clinical reports is essential for improving patient care. While NLP tools have become instrumental in extracting insights from such reports, domain-specific language models for clinical Spanish remain scarce. Therefore, we introduce EriBERTa, the first open-source bilingual clinical language model for English and Spanish, designed to advance clinical NLP in under-resourced settings. We evaluate its performance across multiple dimensions: public vs. proprietary pretraining data, data availability, and cross-lingual transfer. Results show that pretraining on in-domain Electronic Health Records yields strong gains, especially for complex tasks like clinical document section identification. EriBERTa also performs well on monolingual tasks and transfers effectively across languages, making it a valuable tool for multilingual clinical NLP. The model is publicly released to support further research.


