Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

PDF

Brigham Young University

Theses/Dissertations

2015

Data selection

Articles 1 - 1 of 1

Full-Text Articles in Entire DC Network

Data Selection Using Topic Adaptation For Statistical Machine Translation, Hitokazu Matsushita Nov 2015

Data Selection Using Topic Adaptation For Statistical Machine Translation, Hitokazu Matsushita

Theses and Dissertations

Statistical machine translation (SMT) requires large quantities of bitexts (i.e., bilingual parallel corpora) as training data to yield good quality translations. While obtaining a large amount of training data is critical, the similarity between training and test data also has a significant impact on SMT performance. Many SMT studies define data similarity in terms of domain-overlap, and domains are defined to be synonymous with data sources. Consequently, the SMT community has focused on domain adaptation techniques that augment small (in-domain) datasets with large datasets from other sources (hence, out-of-domain, per the definition). However, many training datasets consist of topically diverse …