Computer Sciences | Open Access Articles | Digital Commons Network™

Automatically Extract Information From Web Documents, Dipesh Sharma Dec 2007

Automatically Extract Information From Web Documents, Dipesh Sharma

Masters Theses & Specialist Projects

The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user's behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. …

Go to article

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng Aug 2007

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng

Research Collection School Of Computing and Information Systems

Coronary artery disease (CAD) is a main cause of death in the world. Finding cost-effective methods to predict CAD is a major challenge in public health. In this paper, we investigate the combined effects of genetic polymorphisms and non-genetic factors on predicting the risk of CAD by applying well known classification methods, such as Bayesian networks, naïve Bayes, support vector machine, k-nearest neighbor, neural networks and decision trees. Our experiments show that all these classifiers are comparable in terms of accuracy, while Bayesian networks have the additional advantage of being able to provide insights into the relationships among the variables. …

Go to article

Structure Pattern Analysis Using Term Rewriting And Clustering Algorithm, Xuezheng Fu Jun 2007

Structure Pattern Analysis Using Term Rewriting And Clustering Algorithm, Xuezheng Fu

Computer Science Dissertations

Biological data is accumulated at a fast pace. However, raw data are generally difficult to understand and not useful unless we unlock the information hidden in the data. Knowledge/information can be extracted as the patterns or features buried within the data. Thus data mining, aims at uncovering underlying rules, relationships, and patterns in data, has emerged as one of the most exciting fields in computational science. In this dissertation, we develop efficient approaches to the structure pattern analysis of RNA and protein three dimensional structures. The major techniques used in this work include term rewriting and clustering algorithms. Firstly, a …

Go to article

Multi-Class Classification Averaging Fusion For Detecting Steganography, Benjamin M. Rodriguez, Gilbert L. Peterson, Sos S. Agaian Apr 2007

Multi-Class Classification Averaging Fusion For Detecting Steganography, Benjamin M. Rodriguez, Gilbert L. Peterson, Sos S. Agaian

Faculty Publications

Multiple classifier fusion has the capability of increasing classification accuracy over individual classifier systems. This paper focuses on the development of a multi-class classification fusion based on weighted averaging of posterior class probabilities. This fusion system is applied to the steganography fingerprint domain, in which the classifier identifies the statistical patterns in an image which distinguish one steganography algorithm from another. Specifically we focus on algorithms in which jpeg images provide the cover in order to communicate covertly. The embedding methods targeted are F5, JSteg, Model Based, OutGuess, and StegHide. The developed multi-class steganalvsis system consists of three levels: (1) …

Go to article

Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum Jan 2007

Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum

Andrew McCallum

We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as Principal Component Analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be …

Go to article

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum Jan 2007

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …

Go to article

An Investigation Into The Application Of Data Mining Techniques To Characterize Agricultural Soil Profiles, Rowan J. Maddern Jan 2007

An Investigation Into The Application Of Data Mining Techniques To Characterize Agricultural Soil Profiles, Rowan J. Maddern

Theses : Honours

The advances in computing and information storage have provided vast amounts of data. The challenge has been to extract knowledge from this raw data; this has led to new methods and techniques such as data mining that can bridge the knowledge gap. The research aims to use these new data mining techniques and apply them to a soil science database to establish if meaningful relationships can be found. A data set extracted from the WA Department of Agriculture and Food (DAFW A) soils database has been used to conduct this research. The database contains measurements of soil profile data from …

Go to article

Computer Sciences Commons^™

Full-Text Articles in Computer Sciences

Automatically Extract Information From Web Documents, Dipesh Sharma

Masters Theses & Specialist Projects

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng

Research Collection School Of Computing and Information Systems

Structure Pattern Analysis Using Term Rewriting And Clustering Algorithm, Xuezheng Fu

Computer Science Dissertations

Multi-Class Classification Averaging Fusion For Detecting Steganography, Benjamin M. Rodriguez, Gilbert L. Peterson, Sos S. Agaian

Faculty Publications

Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum

Andrew McCallum

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

An Investigation Into The Application Of Data Mining Techniques To Characterize Agricultural Soil Profiles, Rowan J. Maddern

Theses : Honours