Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Brigham Young University

Theses/Dissertations

Computer Sciences

Data extraction

Publication Year

Articles 1 - 6 of 6

Full-Text Articles in Physical Sciences and Mathematics

A Green Form-Based Information Extraction System For Historical Documents, Tae Woo Kim May 2017

A Green Form-Based Information Extraction System For Historical Documents, Tae Woo Kim

Theses and Dissertations

Many historical documents are rich in genealogical facts. Extracting these facts by hand is tedious and almost impossible considering the hundreds of thousands of genealogically rich family-history books currently scanned and online. As one approach for helping to make the extraction feasible, we propose GreenFIE—a "Green" Form-based Information-Extraction tool which is "green" in the sense that it improves with use toward the goal of minimizing the cost of human labor while maintaining high extraction accuracy. Given a page in a historical document, the user's task is to fill out given forms with all facts on a page in a document …


Generating Data-Extraction Ontologies By Example, Yuanqiu Zhou Nov 2005

Generating Data-Extraction Ontologies By Example, Yuanqiu Zhou

Theses and Dissertations

Ontology-based data-extraction is a resilient web data-extraction approach. A major limitation of this approach is that ontology experts must manually develop and maintain data-extraction ontologies. The limitation prevents ordinary users who have little knowledge of conceptual models from making use of this resilient approach. In this thesis we have designed and implemented a general framework, OntoByE, to generate data-extraction ontologies semi-automatically through a small set of examples collected by users. With the assistance of a limited amount of prior knowledge, experimental evidence shows that OntoByE is capable of interacting with users to generate data-extraction ontologies for domains of interest to …


A Framework For Extraction Plans And Heuristics In An Ontology-Based Data-Extraction System, Alan E. Wessman Jan 2005

A Framework For Extraction Plans And Heuristics In An Ontology-Based Data-Extraction System, Alan E. Wessman

Theses and Dissertations

Extraction of information from semi-structured or unstructured documents, such as Web pages, is a useful yet complex task. Research has demonstrated that ontologies may be used to achieve a high degree of accuracy in data extraction while maintaining resiliency in the face of document changes. Ontologies do not, however, diminish the complexity of a data-extraction system. As research in the field progresses, the need for a modular data-extraction system that de-couples the various functional processes involved continues to grow.

In this thesis we propose a framework for such a system. The nature of the framework allows new algorithms and ideas …


Query Rewriting For Extracting Data Behind Html Forms, Xueqi Chen Apr 2004

Query Rewriting For Extracting Data Behind Html Forms, Xueqi Chen

Theses and Dissertations

Much of the information on the Web is stored in specialized searchable databases and can only be accessed by interacting with a form or a series of forms. As a result, enabling automated agents and Web crawlers to interact with form-based interfaces designed primarily for humans is of great value. This thesis describes a system that can fill out Web forms automatically according to a given user query against a global schema for an application domain and, to the extent possible, extract just the relevant data behind these Web forms. Experimental results on two application domains show that the approach …


Schema Matching And Data Extraction Over Html Tables, Cui Tao Sep 2003

Schema Matching And Data Extraction Over Html Tables, Cui Tao

Theses and Dissertations

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem for the case of mostly structured data in the form of HTML tables, based on document-independent extraction ontologies. The solution entails elements of table location and table understanding, data integration, and wrapper creation. Table location and understanding allows us to locate the table of interest, recognize attributes and values, pair attributes with values, and form records. Data-integration techniques allow us to match source records …


Ontology-Based Extraction Of Rdf Data From The World Wide Web, Timothy Adam Chartrand Mar 2003

Ontology-Based Extraction Of Rdf Data From The World Wide Web, Timothy Adam Chartrand

Theses and Dissertations

The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hinderance to the Semantic Web is the lack of existing semantically marked-up data. Until there is a critical mass of Semantic Web data, few people will develop and use Semantic Web applications. This project helps promote the Semantic Web by providing content. We apply existing information-extraction techniques, in particular, the BYU ontologybased data-extraction …