Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 1 of 1
Full-Text Articles in Entire DC Network
Data Selection Using Topic Adaptation For Statistical Machine Translation, Hitokazu Matsushita
Data Selection Using Topic Adaptation For Statistical Machine Translation, Hitokazu Matsushita
Theses and Dissertations
Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse …