Universal Dependencies annotation of Old English with spaCy and MobileBERT. Evaluation and perspectives

Javier Martín Arista, Ana Elvira Ojanguren López, Sara Domínguez Barragán

Resumen


The aim of this article is to assess three training corpora of Old English and three configurations and training procedures as to the performance of the task of automatic annotation of Universal Dependencies (UD, Nivre et al., 2016). The method is aimed to deciding to what extent the size of the corpus improves results and which configuration turns out the best metrics. The training methods include a pipeline with default configuration, pre-training of tok2vec step and a model of language based on transformers. For all training methods, three training corpora with four different sizes are tested: 1,000, 5,000, 10,000, and 20,000 words. The training and the evaluation corpora are based on ParCorOEv2 (Martín Arista et al., 2021). The results can be summarised as follows. The larger training corpora result in improved performance in all the stages of the pipeline, especially in POS tagging and dependency parsing. Pre-training the tok2vec stage yields better results than the default pipeline. It can be concluded that the performance could improve with more training data or by fine-tuning the models. However, even with the limited training data selected for this study, satisfactory results have been obtained for the task of automatically annotating Old English with UD.

Texto completo:

PDF