Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Brigham Young University

Faculty Publications

2007

Corpora data

Articles 1 - 1 of 1

Full-Text Articles in Physical Sciences and Mathematics

Adtrees For Sequential Data And N-Gram Counting, Robert Van Dam, Dan A. Ventura Oct 2007

Adtrees For Sequential Data And N-Gram Counting, Robert Van Dam, Dan A. Ventura

Faculty Publications

We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naïve approach to …