Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

University of Richmond

Series

2017

Data analysis

Articles 1 - 1 of 1

Full-Text Articles in Physical Sciences and Mathematics

A Tidy Data Model For Natural Language Processing Using Cleannlp, Taylor B. Arnold Dec 2017

A Tidy Data Model For Natural Language Processing Using Cleannlp, Taylor B. Arnold

Department of Math & Statistics Faculty Publications

Recent advances in natural language processing have produced libraries that extract low level features from a collection of raw texts. These features, known as annotations, are usually stored internally in hierarchical, tree-based data structures. This paper proposes a data model to represent annotations as a collection of normalized relational data tables optimized for exploratory data analysis and predictive modeling. The R package cleanNLP, which calls one of two state of the art NLP libraries (CoreNLP or spaCy), is presented as an implementation of this data model. It takes raw text as an input and returns a list of normalized tables. …