Physical Sciences and Mathematics | Open Access Articles

Context-Aware Statistical Debugging: From Bug Predictors To Faulty Control Flow Paths, Lingxiao Jiang, Zhendong Su Nov 2007

Context-Aware Statistical Debugging: From Bug Predictors To Faulty Control Flow Paths, Lingxiao Jiang, Zhendong Su

Research Collection School Of Computing and Information Systems

Effective bug localization is important for realizing automated debugging. One attractive approach is to apply statistical techniques on a collection of evaluation profiles of program properties to help localize bugs. Previous research has proposed various specialized techniques to isolate certain program predicates as bug predictors. However, because many bugs may not be directly associated with these predicates, these techniques are often ineffective in localizing bugs. Relevant control flow paths that may contain bug locations are more informative than stand-alone predicates for discovering and understanding bugs. In this paper, we propose an approach to automatically generate such faulty control flow paths …

Go to article

A Data-Dependent Distance Measure For Transductive Instance-Based Learning, Jared Lundell, Dan A. Ventura Oct 2007

A Data-Dependent Distance Measure For Transductive Instance-Based Learning, Jared Lundell, Dan A. Ventura

Faculty Publications

We consider learning in a transductive setting using instance-based learning (k-NN) and present a method for constructing a data-dependent distance “metric” using both labeled training data as well as available unlabeled data (that is to be classified by the model). This new data-driven measure of distance is empirically studied in the context of various instance-based models and is shown to reduce error (compared to traditional models) under certain learning conditions. Generalizations and improvements are suggested.

Go to article

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng Aug 2007

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng

Research Collection School Of Computing and Information Systems

Coronary artery disease (CAD) is a main cause of death in the world. Finding cost-effective methods to predict CAD is a major challenge in public health. In this paper, we investigate the combined effects of genetic polymorphisms and non-genetic factors on predicting the risk of CAD by applying well known classification methods, such as Bayesian networks, naïve Bayes, support vector machine, k-nearest neighbor, neural networks and decision trees. Our experiments show that all these classifiers are comparable in terms of accuracy, while Bayesian networks have the additional advantage of being able to provide insights into the relationships among the variables. …

Go to article

Active Learning For Part-Of-Speech Tagging: Accelerating Corpus Annotation, George Busby, Marc Carmen, James Carroll, Robbie Haertel, Deryle W. Lonsdale, Peter Mcclanahan, Eric K. Ringger, Kevin Seppi Jun 2007

Active Learning For Part-Of-Speech Tagging: Accelerating Corpus Annotation, George Busby, Marc Carmen, James Carroll, Robbie Haertel, Deryle W. Lonsdale, Peter Mcclanahan, Eric K. Ringger, Kevin Seppi

Faculty Publications

In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual tagging efforts in order to deliver an annotation of highest quality. In this context, we find that active learning is always helpful. We focus on Query by Uncertainty (QBU) and Query by Committee (QBC) and report on experiments with several baselines and …

Go to article

Learning To Classify E-Mail, Irena Koprinska, Josiah Poon, James Clark, Jason Yuk Hin Chan May 2007

Learning To Classify E-Mail, Irena Koprinska, Josiah Poon, James Clark, Jason Yuk Hin Chan

Research Collection School Of Computing and Information Systems

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. …

Go to article

Physical Sciences and Mathematics Commons^™

Full-Text Articles in Physical Sciences and Mathematics

Context-Aware Statistical Debugging: From Bug Predictors To Faulty Control Flow Paths, Lingxiao Jiang, Zhendong Su

Research Collection School Of Computing and Information Systems

A Data-Dependent Distance Measure For Transductive Instance-Based Learning, Jared Lundell, Dan A. Ventura

Faculty Publications

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng

Research Collection School Of Computing and Information Systems

Active Learning For Part-Of-Speech Tagging: Accelerating Corpus Annotation, George Busby, Marc Carmen, James Carroll, Robbie Haertel, Deryle W. Lonsdale, Peter Mcclanahan, Eric K. Ringger, Kevin Seppi

Faculty Publications

Learning To Classify E-Mail, Irena Koprinska, Josiah Poon, James Clark, Jason Yuk Hin Chan

Research Collection School Of Computing and Information Systems