Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus

Carlos Herrero-Zorita, Leonardo Campillos-Llanos, Antonio Moreno-Sandoval


The following paper explains the methodology followed for the creation of a morphologically tagged medical lexicon in Japanese. In order to build this medical resource we have taken into account the morphosyntactic characteristics of the language as well as the origins and formation of the medical terms. Following this, we have compiled a list using the Japanese MutiMedica corpus, special tags from a POS tagger, and several specialised medical dictionaries. After considering three different taggers (ChaSen, Mecab, Juman) we finally chose Juman for the tagging of the lexicon. The issue of the oversegmentation of the language was then corrected and the tags have been normalised. This resource is the base component for the creation of a medical term extractor.

Texto completo: