Parallel corpus alignment at the document, sentence and vocabulary levels

Rogelio Nazar


This paper presents a language independent algorithm for the alignment of parallel corpora
at the document, sentence and vocabulary levels using the to-be aligned corpus itself as the only source of information. The input is a set of documents written in two unknown languages A and B,
where every document in language A has its corresponding translation into language B.
The problem thus consists of: 1) dividing the set of documents in the two languages;
2) aligning at the document level to determine which document in language A is the
original (or translation) of each document in language B; 3) aligning at the sentence level to determine
which sentence in the original corresponds to each sentence in the translation and 4)
aligning at the vocabulary level to determine which word in one language is the equivalent
to each word in the translation. The algorithm is iterative, using the resulting bilingual
vocabulary to re-align the corpus. Evaluation figures in English, Spanish and French show competitive results at all levels.

