Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Series

Data mining

Discipline
Institution
Publication Year
Publication

Articles 121 - 150 of 167

Full-Text Articles in Computer Sciences

Comprehensive Evaluation Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Aditya Budi Sep 2010

Comprehensive Evaluation Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Aditya Budi

Research Collection School Of Computing and Information Systems

In statistics and data mining communities, there have been many measures proposed to gauge the strength of association between two variables of interest, such as odds ratio, confidence, Yule-Y, Yule-Q, Kappa, and gini index. These association measures have been used in various domains, for example, to evaluate whether a particular medical practice is associated positively to a cure of a disease or whether a particular marketing strategy is associated positively to an increase in revenue, etc. This paper models the problem of locating faults as association between the execution or non-execution of particular program elements with failures. There have been …


Where In The World? Demographic Patterns In Access Data, Mimi Recker, Beijie Xu, Sherry Hsi, Christine Garrard Jun 2010

Where In The World? Demographic Patterns In Access Data, Mimi Recker, Beijie Xu, Sherry Hsi, Christine Garrard

Instructional Technology and Learning Sciences Faculty Publications

Standard webmetrics tools record the IP address of users’ computers, thereby providing fodder for analyses of their geographical location, and for understanding the impact of e-learning and teaching. Here we describe how two web-based educational systems were engineered to collect geo-referenced data. This is followed by a description of joining these data with demographic and educational datasets for the United States, and mapping different datasets using geographic information system (GIS) techniques to visually display their relationships. Results from statistical analyses of these relationships that highlight areas of significance are given.


Stevent: Spatio-Temporal Event Model For Social Network Discovery, Hady W. Lauw, Ee Peng Lim, Hwee Hwa Pang, Teck-Tim Tan Jun 2010

Stevent: Spatio-Temporal Event Model For Social Network Discovery, Hady W. Lauw, Ee Peng Lim, Hwee Hwa Pang, Teck-Tim Tan

Research Collection School Of Computing and Information Systems

Spatio-temporal data concerning the movement of individuals over space and time contains latent information on the associations among these individuals. Sources of spatio-temporal data include usage logs of mobile and Internet technologies. This article defines a spatio-temporal event by the co-occurrences among individuals that indicate potential associations among them. Each spatio-temporal event is assigned a weight based on the precision and uniqueness of the event. By aggregating the weights of events relating two individuals, we can determine the strength of association between them. We conduct extensive experimentation to investigate both the efficacy of the proposed model as well as the …


Partitioning Of Minimotifs Based On Function With Improved Prediction Accuracy, Sanguthevar Rajasekaran, Tian Mi, Jerlin Camilus Merlin, Aaron Oommen, Patrick R. Gradie, Martin R. Schiller Apr 2010

Partitioning Of Minimotifs Based On Function With Improved Prediction Accuracy, Sanguthevar Rajasekaran, Tian Mi, Jerlin Camilus Merlin, Aaron Oommen, Patrick R. Gradie, Martin R. Schiller

Life Sciences Faculty Research

Background

Minimotifs are short contiguous peptide sequences in proteins that are known to have a function in at least one other protein. One of the principal limitations in minimotif prediction is that false positives limit the usefulness of this approach. As a step toward resolving this problem we have built, implemented, and tested a new data-driven algorithm that reduces false-positive predictions.

Methodology/Principal Findings

Certain domains and minimotifs are known to be strongly associated with a known cellular process or molecular function. Therefore, we hypothesized that by restricting minimotif predictions to those where the minimotif containing protein and target protein have …


Interrogation Of Water Catchment Data Sets Using Data Mining Techniques, Ajdin Sehovic, Leisa Armstrong, Dean Diepeveen Jan 2010

Interrogation Of Water Catchment Data Sets Using Data Mining Techniques, Ajdin Sehovic, Leisa Armstrong, Dean Diepeveen

Research outputs pre 2011

Current environmental challenges such as increasing dry land salinity, water logging, eutrophication and high nutrient runoff in south western regions of Western Australia (WA) may have both cultural and environmental implications in the near future. Advances in computing through the application of data mining ,and geographic information services provide the tools to conduct •studies that can indicate possible changes in these water catchment areas of WA. The research examines the existing spatial data mining techniques that can be used to interpret trends in WA water catchment land use. Large GIS data sets of the water catchments on Peel-Harvey region have …


An Attempt To Find Neighbors, Yong Shi, Ryan Rosenblum Jan 2010

An Attempt To Find Neighbors, Yong Shi, Ryan Rosenblum

Faculty Articles

In this paper, we present our continuous research on similarity search problems. Previously we proposed PanKNN[18]which is a novel technique that explores the meaning of K nearest neighbors from a new perspective, redefines the distances between data points and a given query point Q, and efficiently and effectively selects data points which are closest to Q. It can be applied in various data mining fields. In this paper, we present our approach to solving the similarity search problem in the presence of obstacles. We apply the concept of obstacle points and process the similarity search problems in a different way. …


Educational Data Mining Approaches For Digital Libraries, Mimi Recker, Sherry Hsi, Beijie Xu, Rob Rothfarb Nov 2009

Educational Data Mining Approaches For Digital Libraries, Mimi Recker, Sherry Hsi, Beijie Xu, Rob Rothfarb

Instructional Technology and Learning Sciences Faculty Publications

This collaborative research project between the Exploratorium and Utah State's Department of Instructional Technology and Learning Sciences investigates online evaluation approaches and the application of educational data mining to educational digital libraries and services. Much work over the past decades has focused on developing algorithms and methods for discovering patterns in large datasets, known as Knowledge Discovery from Data (KDD). Webmetrics, the application of KDD to web usage mining, is growing rapidly in areas such as e-commerce. Educational Data Mining (EDM) is just beginning to emerge as a tool to analyze the massive, longitudinal user data that are captured in …


Data Mining For Software Engineering, Tao Xie, Suresh Thummalapenta, David Lo, Chao Liu Aug 2009

Data Mining For Software Engineering, Tao Xie, Suresh Thummalapenta, David Lo, Chao Liu

Research Collection School Of Computing and Information Systems

To improve software productivity and quality, software engineers are increasingly applying data mining algorithms to various software engineering tasks. However, mining SE data poses several challenges. The authors present various algorithms to effectively mine sequences, graphs, and text from such data.


Sentiment Classification Of Reviews Using Sentiwordnet, Bruno Ohana, Brendan Tierney Jan 2009

Sentiment Classification Of Reviews Using Sentiwordnet, Bruno Ohana, Brendan Tierney

Conference papers

Sentiment classification concerns the use of automatic methods for predicting the orientation of subjective content on text documents, with applications on a number of areas including recommender and advertising systems, customer intelligence and information retrieval. SentiWordNet is an opinion lexicon derived from the WordNet database where each term is associated with numerical scores indicating positive and negative sentiment information. This research presents the results of applying the SentiWordNet lexical resource to the problem of automatic sentiment classification of film reviews. Our approach comprises counting positive and negative term scores to determine sentiment orientation, and an improvement is presented by building …


An Enhanced Data Mining Life Cycle, Markus Hofmann, Brendan Tierney Jan 2009

An Enhanced Data Mining Life Cycle, Markus Hofmann, Brendan Tierney

Conference papers

Data mining projects are complex and can have a high failure rate. In order to improve project management and success rates of such projects a life cycle is vital to the overall success of the project. This paper reports on a research project that was concerned with the life cycle development for data mining projects, its team members and their role. The paper provides a detailed view of the design and development of the data mining life cycle called DMLC. The life cycle aims to support all members of data mining project teams as well as IT managers and academic …


Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder Sep 2008

Word Sense Disambiguation In Biomedical Ontologies With Term Co-Occurrence Analysis And Document Clustering, Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder

Faculty Publications, Computer Science

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as |development| can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an …


Relational Methodology For Data Mining And Knowledge Discovery, Engenii Vityaev, Boris Kovalerchuk Apr 2008

Relational Methodology For Data Mining And Knowledge Discovery, Engenii Vityaev, Boris Kovalerchuk

All Faculty Scholarship for the College of the Sciences

Knowledge discovery and data mining methods have been successful in many domains. However, their abilities to build or discover a domain theory remain unclear. This is largely due to the fact that many fundamental KDD&DM methodological questions are still unexplored such as (1) the nature of the information contained in input data relative to the domain theory, and (2) the nature of the knowledge that these methods discover. The goal of this paper is to clarify methodological questions of KDD&DM methods. This is done by using the concept of Relational Data Mining (RDM), representative measurement theory, an ontology of a …


Symbolic Methodology For Numeric Data Mining, Boris Kovalerchuk, Engenii Vityaev Apr 2008

Symbolic Methodology For Numeric Data Mining, Boris Kovalerchuk, Engenii Vityaev

All Faculty Scholarship for the College of the Sciences

Currently statistical and artificial neural network methods dominate in data mining applications. Alternative relational (symbolic) data mining methods have shown their effectiveness in robotics, drug design, and other areas. Neural networks and decision tree methods have serious limitations in capturing relations that may have a variety of forms. Learning systems based on symbolic first-order logic (FOL) representations capture relations naturally. The learned regularities are understandable directly in domain terms that help to build a domain theory. This paper describes relational data mining methodology and develops it further for numeric data such as financial and spatial data. This includes (1) comparing …


Using Plsi-U To Detect Insider Threats By Datamining Email, James S. Okolica, Gilbert L. Peterson, Robert F. Mills Feb 2008

Using Plsi-U To Detect Insider Threats By Datamining Email, James S. Okolica, Gilbert L. Peterson, Robert F. Mills

Faculty Publications

Despite a technology bias that focuses on external electronic threats, insiders pose the greatest threat to an organisation. This paper discusses an approach to assist investigators in identifying potential insider threats. We discern employees' interests from e-mail using an extended version of PLSI. These interests are transformed into implicit and explicit social network graphs, which are used to locate potential insiders by identifying individuals who feel alienated from the organisation or have a hidden interest in a sensitive topic. By applying this technique to the Enron e-mail corpus, a small number of employees appear as potential insider threats.


Optrr: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining, Zhengli Huang, Wenliang Du Jan 2008

Optrr: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining, Zhengli Huang, Wenliang Du

Electrical Engineering and Computer Science - All Scholarship

The randomized response (RR) technique is a promising technique to disguise private categorical data in Privacy-Preserving Data Mining (PPDM). Although a number of RR-based methods have been proposed for various data mining computations, no study has systematically compared them to find optimal RR schemes. The difficulty of comparison lies in the fact that to compare two PPDM schemes, one needs to consider two conflicting metrics: privacy and utility. An optimal scheme based on one metric is usually the worst based on the other metric. In this paper, we first describe a method to quantify privacy and utility. We formulate the …


Mobile Semantic Computing, Karthik Gomadam, Anupam Joshi, Amit P. Sheth Jan 2008

Mobile Semantic Computing, Karthik Gomadam, Anupam Joshi, Amit P. Sheth

Kno.e.sis Publications

We propose to organize a special session on research in the intersection of mobile computing, the Semantic Web and Web services.

This session will examine how the research in these areas can serve as a foundation for new architectural and communication paradigms that can enhance service creation, distribution, discovery, integration and utilization in distributed and ubiquitous environments. Some of the initial areas that our early research have highlighted are :

  1. Semantic annotation of data in bandwidth constrained environments such as mobile networks to promote efficient bandwidth utilization
  2. Possibilities of using microformats such as RDFa and opportunities that can be explored …


The Impact Of Directionality In Predications On Text Mining, Gondy Leroy, Marcelo Fiszman, Thomas C. Rindflesch Jan 2008

The Impact Of Directionality In Predications On Text Mining, Gondy Leroy, Marcelo Fiszman, Thomas C. Rindflesch

CGU Faculty Publications and Research

The number of publications in biomedicine is increasing enormously each year. To help researchers digest the information in these documents, text mining tools are being developed that present co-occurrence relations between concepts. Statistical measures are used to mine interesting subsets of relations. We demonstrate how directionality of these relations affects interestingness. Support and confidence, simple data mining statistics, are used as proxies for interestingness metrics. We first built a test bed of 126,404 directional relations extracted from biomedical abstracts, which we represent as graphs containing a central starting concept and 2 rings of associated relations. We manipulated directionality in four …


Automatically Extract Information From Web Documents, Dipesh Sharma Dec 2007

Automatically Extract Information From Web Documents, Dipesh Sharma

Masters Theses & Specialist Projects

The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user's behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. …


Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng Aug 2007

Predicting Coronary Artery Disease With Medical Profile And Gene Polymorphisms Data, Qiongyu Chen, Guoliang Li, Tze-Yun Leong, Chew-Kiat Heng

Research Collection School Of Computing and Information Systems

Coronary artery disease (CAD) is a main cause of death in the world. Finding cost-effective methods to predict CAD is a major challenge in public health. In this paper, we investigate the combined effects of genetic polymorphisms and non-genetic factors on predicting the risk of CAD by applying well known classification methods, such as Bayesian networks, naïve Bayes, support vector machine, k-nearest neighbor, neural networks and decision trees. Our experiments show that all these classifiers are comparable in terms of accuracy, while Bayesian networks have the additional advantage of being able to provide insights into the relationships among the variables. …


Multi-Class Classification Averaging Fusion For Detecting Steganography, Benjamin M. Rodriguez, Gilbert L. Peterson, Sos S. Agaian Apr 2007

Multi-Class Classification Averaging Fusion For Detecting Steganography, Benjamin M. Rodriguez, Gilbert L. Peterson, Sos S. Agaian

Faculty Publications

Multiple classifier fusion has the capability of increasing classification accuracy over individual classifier systems. This paper focuses on the development of a multi-class classification fusion based on weighted averaging of posterior class probabilities. This fusion system is applied to the steganography fingerprint domain, in which the classifier identifies the statistical patterns in an image which distinguish one steganography algorithm from another. Specifically we focus on algorithms in which jpeg images provide the cover in order to communicate covertly. The embedding methods targeted are F5, JSteg, Model Based, OutGuess, and StegHide. The developed multi-class steganalvsis system consists of three levels: (1) …


Bias And Controversy: Beyond The Statistical Deviation, Hady W. Lauw, Ee Peng Lim, Ke Wang Aug 2006

Bias And Controversy: Beyond The Statistical Deviation, Hady W. Lauw, Ee Peng Lim, Ke Wang

Research Collection School Of Computing and Information Systems

In this paper, we investigate how deviation in evaluation activities may reveal bias on the part of reviewers and controversy on the part of evaluated objects. We focus on a 'data-centric approach' where the evaluation data is assumed to represent the ground truth'. The standard statistical approaches take evaluation and deviation at face value. We argue that attention should be paid to the subjectivity of evaluation, judging the evaluation score not just on 'what is being said' (deviation), but also on 'who says it' (reviewer) as well as on 'whom it is said about' (object). Furthermore, we observe that bias …


Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang Jun 2006

Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang

Faculty Publications, Computer Science

Biomedical data sets often have mixed categorical and numerical types, where the former represent semantic information on the objects and the latter represent experimental results. We present the BILCOM algorithm for |Bi-Level Clustering of Mixed categorical and numerical data types|. BILCOM performs a pseudo-Bayesian process, where the prior is categorical clustering. BILCOM partitions biomedical data sets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations, more accurately than if using one type alone.


Sgpm: Static Group Pattern Mining Using Apriori-Like Sliding Window, John Goh, David Taniar, Ee Peng Lim Apr 2006

Sgpm: Static Group Pattern Mining Using Apriori-Like Sliding Window, John Goh, David Taniar, Ee Peng Lim

Research Collection School Of Computing and Information Systems

Mobile user data mining is a field that focuses on extracting interesting pattern and knowledge out from data generated by mobile users. Group pattern is a type of mobile user data mining method. In group pattern mining, group patterns from a given user movement database is found based on spatio-temporal distances. In this paper, we propose an improvement of efficiency using area method for locating mobile users and using sliding window for static group pattern mining. This reduces the complexity of valid group pattern mining problem. We support the use of static method, which uses areas and sliding windows instead …


Fisa: Feature-Based Instance Selection For Imbalanced Text Classification, Aixin Sun, Ee Peng Lim, Boualem Benatallah, Mahbub Hassan Apr 2006

Fisa: Feature-Based Instance Selection For Imbalanced Text Classification, Aixin Sun, Ee Peng Lim, Boualem Benatallah, Mahbub Hassan

Research Collection School Of Computing and Information Systems

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning …


Data Mining Techniques To Study Therapy Success With Autistic Children, Gondy A. Leroy, Annika Irmscher, Marjorie H. Charlop Jan 2006

Data Mining Techniques To Study Therapy Success With Autistic Children, Gondy A. Leroy, Annika Irmscher, Marjorie H. Charlop

CGU Faculty Publications and Research

Autism spectrum disorder has become one of the most prevalent developmental disorders, characterized by a wide variety of symptoms. Many children need extensive therapy for years to improve their behavior and facilitate integration in society. However, few systematic evaluations are done on a large scale that can provide insights into how, where, and how therapy has an impact. We describe how data mining techniques can be used to provide insights into behavioral therapy as well as its effect on participants. To this end, we are developing a digital library of coded video segments that contains data on appropriate and inappropriate …


Application Of Information-Theoretic Data Mining Techniques In A National Ambulatory Practice Outcomes Research Network, Adam Wright, Thomas N. Ricciardi, Martin Zwick Oct 2005

Application Of Information-Theoretic Data Mining Techniques In A National Ambulatory Practice Outcomes Research Network, Adam Wright, Thomas N. Ricciardi, Martin Zwick

Systems Science Faculty Publications and Presentations

The Medical Quality Improvement Consortium data warehouse contains de-identified data on more than 3.6 million patients including their problem lists, test results, procedures and medication lists. This study uses reconstructability analysis, an information-theoretic data mining technique, on the MQIC data warehouse to empirically identify risk factors for various complications of diabetes including myocardial infarction and microalbuminuria. The risk factors identified match those risk factors identified in the literature, demonstrating the utility of the MQIC data warehouse for outcomes research, and RA as a technique for mining clinical data warehouses.


Social Network Discovery By Mining Spatio-Temporal Events, Hady Lauw, Ee Peng Lim, Hwee Hwa Pang, Teck-Tim Tan Jul 2005

Social Network Discovery By Mining Spatio-Temporal Events, Hady Lauw, Ee Peng Lim, Hwee Hwa Pang, Teck-Tim Tan

Research Collection School Of Computing and Information Systems

Knowing patterns of relationship in a social network is very useful for law enforcement agencies to investigate collaborations among criminals, for businesses to exploit relationships to sell products, or for individuals who wish to network with others. After all, it is not just what you know, but also whom you know, that matters. However, finding out who is related to whom on a large scale is a complex problem. Asking every single individual would be impractical, given the huge number of individuals and the changing dynamics of relationships. Recent advancement in technology has allowed more data about activities of individuals …


On The Optimization Of Visualizations Of Complex Phenomena, Donald H. House, Althea D. Bair, Colin Ware Jan 2005

On The Optimization Of Visualizations Of Complex Phenomena, Donald H. House, Althea D. Bair, Colin Ware

Center for Coastal and Ocean Mapping

The problem of perceptually optimizing complex visualizations is a difficult one, involving perceptual as well as aesthetic issues. In our experience, controlled experiments are quite limited in their ability to uncover interrelationships among visualization parameters, and thus may not be the most useful way to develop rules-of-thumb or theory to guide the production of high-quality visualizations. In this paper, we propose a new experimental approach to optimizing visualization quality that integrates some of the strong points of controlled experiments with methods more suited to investigating complex highly-coupled phenomena. We use human-in-the-loop experiments to search through visualization parameter space, generating large …


Blocking Reduction Strategies In Hierarchical Text Classification, Ee Peng Lim, Aixin Sun, Wee-Keong Ng, Jaideep Srivastava Oct 2004

Blocking Reduction Strategies In Hierarchical Text Classification, Ee Peng Lim, Aixin Sun, Wee-Keong Ng, Jaideep Srivastava

Research Collection School Of Computing and Information Systems

One common approach in hierarchical text classification involves associating classifiers with nodes in the category tree and classifying text documents in a top-down manner. Classification methods using this top-down approach can scale well and cope with changes to the category trees. However, all these methods suffer from blocking which refers to documents wrongly rejected by the classifiers at higher-levels and cannot be passed to the classifiers at lower-levels. We propose a classifier-centric performance measure known as blocking factor to determine the extent of the blocking. Three methods are proposed to address the blocking problem, namely, threshold reduction, restricted voting, and …


Enhancements To Crisp Possibilistic Reconstructability Analysis, Anas Al-Rabadi, Martin Zwick Aug 2004

Enhancements To Crisp Possibilistic Reconstructability Analysis, Anas Al-Rabadi, Martin Zwick

Systems Science Faculty Publications and Presentations

Modified Reconstructibility Analysis (MRA), a novel decomposition within the framework of set-theoretic (crisp possibilistic) Reconstructibility Analysis, is presented. It is shown that in some cases while 3-variable NPN-classified Boolean functions are not decomposable using Conventional Reconstructibility Analysis (CRA), they are decomposable using Modified Reconstructibility Analysis (MRA). Also, it is shown that whenever a decomposition of 3-variable NPN-classified Boolean functions exists in both MRA and CRA, MRA yields simpler or equal complexity decompositions. A comparison of the corresponding complexities for Ashenhurst-Curtis decompositions, and Modified Reconstructibility Analysis (MRA) is also presented. While both AC and MRA decompose some but …