Constructing Corpus and Word Embedding for Spanish Covid-19 Data

Kyungjin Hwang


Severe acute respiratory syndrome coronavirus 2 (COVID 19), colloquially referred to as coronavirus, escalated into a global pandemic with severe transmission and mortality rates in 2019. Despite the escalation of the virus’ worldwide impact in 2020, numerous studies on Natural Language Processing in Spanish have neglected corpus construction or word embedding, especially conspicuous in its absence being the corpora involving coronavirus or infectious diseases. Additionally, corpus construction or word embedding conducted in the medical field do not display efficacy in production pertaining to coronavirus or infectious diseases. To supplement this potentially detrimental insufficiency, this study collects Spanish Language data to build a relevant coronavirus corpus through appropriate preprocessing and then obtains a word embedding. Performance of the corpus and word embedding are then tested through word similarity evaluations, a cosine similarity evaluation, and a visualization evaluation with the existing Spanish corpus. After comparison, corpus and word embedding suitable for coronavirus will be suggested.

Texto completo: