Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

PDF

2006

Data mining

Discipline
Institution
Publication
Publication Type

Articles 1 - 28 of 28

Full-Text Articles in Entire DC Network

Multivariate Discretization Of Continuous Valued Attributes., Ehab Ahmed El Sayed Ahmed 1978- Dec 2006

Multivariate Discretization Of Continuous Valued Attributes., Ehab Ahmed El Sayed Ahmed 1978-

Electronic Theses and Dissertations

The area of Knowledge discovery and data mining is growing rapidly. Feature Discretization is a crucial issue in Knowledge Discovery in Databases (KDD), or Data Mining because most data sets used in real world applications have features with continuously values. Discretization is performed as a preprocessing step of the data mining to make data mining techniques useful for these data sets. This thesis addresses discretization issue by proposing a multivariate discretization (MVD) algorithm. It begins withal number of common discretization algorithms like Equal width discretization, Equal frequency discretization, Naïve; Entropy based discretization, Chi square discretization, and orthogonal hyper planes. After …


Bias And Controversy: Beyond The Statistical Deviation, Hady W. Lauw, Ee Peng Lim, Ke Wang Aug 2006

Bias And Controversy: Beyond The Statistical Deviation, Hady W. Lauw, Ee Peng Lim, Ke Wang

Research Collection School Of Computing and Information Systems

In this paper, we investigate how deviation in evaluation activities may reveal bias on the part of reviewers and controversy on the part of evaluated objects. We focus on a 'data-centric approach' where the evaluation data is assumed to represent the ground truth'. The standard statistical approaches take evaluation and deviation at face value. We argue that attention should be paid to the subjectivity of evaluation, judging the evaluation score not just on 'what is being said' (deviation), but also on 'who says it' (reviewer) as well as on 'whom it is said about' (object). Furthermore, we observe that bias …


Mining Medical Data In A Clinical Environment, Tim V. Ivanovskiy Jul 2006

Mining Medical Data In A Clinical Environment, Tim V. Ivanovskiy

USF Tampa Graduate Theses and Dissertations

The availability of new treatments for a disease depends on the success of clinical trials. In order for a clinical trial to be successful and approved, medical researchers must first recruit patients with a specific set of conditions in order to test the effectiveness of the proposed treatment. In the past, the accrual process was tedious and time-consuming. Since accruals rely heavily on the ability of physicians and their staff to be familiar with the protocol eligibility criteria, candidates tend to be missed. This can result and has resulted in unsuccessful trials.A recent project at the University of South Florida …


A Framework For Spatio-Temporal Data Analysis And Hypothesis Exploration, Alexander Campbell, Binh Pham, Yu-Chu Tian Jul 2006

A Framework For Spatio-Temporal Data Analysis And Hypothesis Exploration, Alexander Campbell, Binh Pham, Yu-Chu Tian

International Congress on Environmental Modelling and Software

We present a general framework for pattern discovery and hypothesis exploration in spatio-temporaldata sets that is based on delay-embedding. This is a remarkable method of nonlinear time-series analysis thatallows the full phase-space behaviour of a system to be reconstructed from only a single observable (accessiblevariable). Recent extensions to the theory that focus on a probabilistic interpretation extend its scope and allowpractical application to noisy, uncertain and high-dimensional systems. Our framework uses these extensions toaid alignment of spatio-temporal sub-models (hypotheses) to empirical data - for example, satellite images plusremote-sensing - and to explore behaviours consistent with this alignment. The novel aspect …


Data Mining Approaches To Explaining Aerosol Formation, Saara Hyvonen, Heikki Junninen, Lauri Laakso, Miikka Dal Maso, Tiia Gronholm, Boris Bonn, Petri Keronen, Pasi Aalto, Veijo Hiltunen, Toivo Pohja, Samuli Launiainen, Pertti Hari, Heikki Mannila, H.C. Hansoon, M. Kulmala Jul 2006

Data Mining Approaches To Explaining Aerosol Formation, Saara Hyvonen, Heikki Junninen, Lauri Laakso, Miikka Dal Maso, Tiia Gronholm, Boris Bonn, Petri Keronen, Pasi Aalto, Veijo Hiltunen, Toivo Pohja, Samuli Launiainen, Pertti Hari, Heikki Mannila, H.C. Hansoon, M. Kulmala

International Congress on Environmental Modelling and Software

Atmospheric aerosol particle formation is frequently observed in various environments. Yet, despite numerous studies, processes behind these so called nucleation events remain unclear. In this work we describe the use of data mining techniques to detect factors influencing particle formation. These techniques are applied to a dataset of eight years of 80 variables collected at the boreal forest station (SMEAR II) in Southern Finland, including air pollutant, weather, gas and particle measurements. In a previous study classification methods have been used together with feature selection in order to understand what causes nucleation. Each day was classified as an event day, …


Data Mining And Image Segmentation Approaches For Classifying Defoliation In Aerial Forest Imagery, K. Fukuda, P. A. Pearson Jul 2006

Data Mining And Image Segmentation Approaches For Classifying Defoliation In Aerial Forest Imagery, K. Fukuda, P. A. Pearson

International Congress on Environmental Modelling and Software

Experimental data mining and image segmentation approaches are developed to add insight towards aerial image interpretation for defoliation survey procedures. A decision tree classifier generated with a data mining package, WEKA [Witten and Frank, 2005], based on the contents of a small number of training data points, identified from known classes, is used to predict the extents of regions containing different levels of tree mortality (severe, moderate, light and non attack) and land cover (vegetation and ground surface). This approach is applicable to low quality imagery without traditional image pre-processing (e.g., normalization or noise reduction). To generate the decision tree, …


Data Mining As A Tool For Environmental Scientists, J. M. Spate, Karina Gibert, Miquel Sànchez-Marrè, E. Frank, Joaquim Comas, Ioannis N. Athanasiadis Jul 2006

Data Mining As A Tool For Environmental Scientists, J. M. Spate, Karina Gibert, Miquel Sànchez-Marrè, E. Frank, Joaquim Comas, Ioannis N. Athanasiadis

International Congress on Environmental Modelling and Software

Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data …


A Framework For Spatio-Temporal Data Analysis And Hypothesis Exploration, Alexander Campbell, Binh Pham, Yu-Chu Tian Jul 2006

A Framework For Spatio-Temporal Data Analysis And Hypothesis Exploration, Alexander Campbell, Binh Pham, Yu-Chu Tian

International Congress on Environmental Modelling and Software

We present a general framework for pattern discovery and hypothesis exploration in spatio-temporaldata sets that is based on delay-embedding. This is a remarkable method of nonlinear time-series analysis thatallows the full phase-space behaviour of a system to be reconstructed from only a single observable (accessiblevariable). Recent extensions to the theory that focus on a probabilistic interpretation extend its scope and allowpractical application to noisy, uncertain and high-dimensional systems. Our framework uses these extensions toaid alignment of spatio-temporal sub-models (hypotheses) to empirical data - for example, satellite images plusremote-sensing - and to explore behaviours consistent with this alignment. The novel aspect …


Data Mining Approaches To Explaining Aerosol Formation, Saara Hyvonen, Heikki Junninen, Lauri Laakso, Miikka Dal Maso, Tiia Gronholm, Boris Bonn, Petri Keronen, Pasi Aalto, Veijo Hiltunen, Toivo Pohja, Samuli Launiainen, Pertti Hari, Heikki Mannila, H.C. Hansoon, M. Kulmala Jul 2006

Data Mining Approaches To Explaining Aerosol Formation, Saara Hyvonen, Heikki Junninen, Lauri Laakso, Miikka Dal Maso, Tiia Gronholm, Boris Bonn, Petri Keronen, Pasi Aalto, Veijo Hiltunen, Toivo Pohja, Samuli Launiainen, Pertti Hari, Heikki Mannila, H.C. Hansoon, M. Kulmala

International Congress on Environmental Modelling and Software

Atmospheric aerosol particle formation is frequently observed in various environments. Yet, despite numerous studies, processes behind these so called nucleation events remain unclear. In this work we describe the use of data mining techniques to detect factors influencing particle formation. These techniques are applied to a dataset of eight years of 80 variables collected at the boreal forest station (SMEAR II) in Southern Finland, including air pollutant, weather, gas and particle measurements. In a previous study classification methods have been used together with feature selection in order to understand what causes nucleation. Each day was classified as an event day, …


Data Mining And Image Segmentation Approaches For Classifying Defoliation In Aerial Forest Imagery, K. Fukuda, P. A. Pearson Jul 2006

Data Mining And Image Segmentation Approaches For Classifying Defoliation In Aerial Forest Imagery, K. Fukuda, P. A. Pearson

International Congress on Environmental Modelling and Software

Experimental data mining and image segmentation approaches are developed to add insight towards aerial image interpretation for defoliation survey procedures. A decision tree classifier generated with a data mining package, WEKA [Witten and Frank, 2005], based on the contents of a small number of training data points, identified from known classes, is used to predict the extents of regions containing different levels of tree mortality (severe, moderate, light and non attack) and land cover (vegetation and ground surface). This approach is applicable to low quality imagery without traditional image pre-processing (e.g., normalization or noise reduction). To generate the decision tree, …


Data Mining As A Tool For Environmental Scientists, J. M. Spate, Karina Gibert, Miquel Sànchez-Marrè, E. Frank, Joaquim Comas, Ioannis N. Athanasiadis Jul 2006

Data Mining As A Tool For Environmental Scientists, J. M. Spate, Karina Gibert, Miquel Sànchez-Marrè, E. Frank, Joaquim Comas, Ioannis N. Athanasiadis

International Congress on Environmental Modelling and Software

Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data …


Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang Jun 2006

Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang

Faculty Publications, Computer Science

Biomedical data sets often have mixed categorical and numerical types, where the former represent semantic information on the objects and the latter represent experimental results. We present the BILCOM algorithm for |Bi-Level Clustering of Mixed categorical and numerical data types|. BILCOM performs a pseudo-Bayesian process, where the prior is categorical clustering. BILCOM partitions biomedical data sets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations, more accurately than if using one type alone.


Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang Jun 2006

Bi-Level Clustering Of Mixed Categorical And Numerical Biomedical Data, Bill Andreopoulos, Aijun An, Xiaogang Wang

William B. Andreopoulos

Biomedical data sets often have mixed categorical and numerical types, where the former represent semantic information on the objects and the latter represent experimental results. We present the BILCOM algorithm for |Bi-Level Clustering of Mixed categorical and numerical data types|. BILCOM performs a pseudo-Bayesian process, where the prior is categorical clustering. BILCOM partitions biomedical data sets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations, more accurately than if using one type alone.


Enhancing Web Marketing By Using Ontology, Xuan Zhou May 2006

Enhancing Web Marketing By Using Ontology, Xuan Zhou

Dissertations

The existence of the Web has a major impact on people's life styles. Online shopping, online banking, email, instant messenger services, search engines and bulletin boards have gradually become parts of our daily life. All kinds of information can be found on the Web. Web marketing is one of the ways to make use of online information. By extracting demographic information and interest information from the Web, marketing knowledge can be augmented by applying data mining algorithms. Therefore, this knowledge which connects customers to products can be used for marketing purposes and for targeting existing and potential customers. The Web …


Temporal Data Mining In A Dynamic Feature Space, Brent K. Wenerstrom May 2006

Temporal Data Mining In A Dynamic Feature Space, Brent K. Wenerstrom

Theses and Dissertations

Many interesting real-world applications for temporal data mining are hindered by concept drift. One particular form of concept drift is characterized by changes to the underlying feature space. Seemingly little has been done to address this issue. This thesis presents FAE, an incremental ensemble approach to mining data subject to concept drift. FAE achieves better accuracies over four large datasets when compared with a similar incremental learning algorithm.


Text Mining Comorbidity Codes In The Analysis Of Cardiopulmonary Rehabilitation Data., Jennifer Ferrell 1982- May 2006

Text Mining Comorbidity Codes In The Analysis Of Cardiopulmonary Rehabilitation Data., Jennifer Ferrell 1982-

Electronic Theses and Dissertations

The purpose of this paper is to examine the process of text mining and using the results to show the possible benefits of cardiopulmonary rehabilitation. The 555 patients enrolled in the study were receiving inpatient cardiopulmonary rehabilitation. Each patient had comorbidity codes associated with them. These codes are secondary diagnoses to the cardiac or pulmonary event that resulted in their hospitalization. The patients had secondary conditions ranging in number from 1 to 10. The patients were assessed at admission and discharge for functional independence. Since there are numerous comorbidity codes for each patient, it would be difficult to analyze each …


Sgpm: Static Group Pattern Mining Using Apriori-Like Sliding Window, John Goh, David Taniar, Ee Peng Lim Apr 2006

Sgpm: Static Group Pattern Mining Using Apriori-Like Sliding Window, John Goh, David Taniar, Ee Peng Lim

Research Collection School Of Computing and Information Systems

Mobile user data mining is a field that focuses on extracting interesting pattern and knowledge out from data generated by mobile users. Group pattern is a type of mobile user data mining method. In group pattern mining, group patterns from a given user movement database is found based on spatio-temporal distances. In this paper, we propose an improvement of efficiency using area method for locating mobile users and using sliding window for static group pattern mining. This reduces the complexity of valid group pattern mining problem. We support the use of static method, which uses areas and sliding windows instead …


Fisa: Feature-Based Instance Selection For Imbalanced Text Classification, Aixin Sun, Ee Peng Lim, Boualem Benatallah, Mahbub Hassan Apr 2006

Fisa: Feature-Based Instance Selection For Imbalanced Text Classification, Aixin Sun, Ee Peng Lim, Boualem Benatallah, Mahbub Hassan

Research Collection School Of Computing and Information Systems

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning …


Detecting Potential Insider Threats Through Email Datamining, James S. Okolica Mar 2006

Detecting Potential Insider Threats Through Email Datamining, James S. Okolica

Theses and Dissertations

No abstract provided.


Text Mining With Exploitation Of User's Background Knowledge : Discovering Novel Association Rules From Text, Xin Chen Jan 2006

Text Mining With Exploitation Of User's Background Knowledge : Discovering Novel Association Rules From Text, Xin Chen

Dissertations

The goal of text mining is to find interesting and non-trivial patterns or knowledge from unstructured documents. Both objective and subjective measures have been proposed in the literature to evaluate the interestingness of discovered patterns. However, objective measures alone are insufficient because such measures do not consider knowledge and interests of the users. Subjective measures require explicit input of user expectations which is difficult or even impossible to obtain in text mining environments.

This study proposes a user-oriented text-mining framework and applies it to the problem of discovering novel association rules from documents. The developed system, uMining, consists of two …


Topics Over Time: A Nonmarkov Continuoustime Model Of Topical Trends, Xuerui Wang, Andrew Mccallum Jan 2006

Topics Over Time: A Nonmarkov Continuoustime Model Of Topical Trends, Xuerui Wang, Andrew Mccallum

Andrew McCallum

This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, …


Lost In Translation? Data Mining, National Security And The Adverse Inference Problem, Anita Ramasastry Jan 2006

Lost In Translation? Data Mining, National Security And The Adverse Inference Problem, Anita Ramasastry

Articles

To the extent that we permit data mining programs to proceed, they must provide adequate due process and redress mechanisms that permit individuals to clear their names. A crucial criteria for such a mechanism is to allow access to information that was used to make adverse assessments so that errors may be corrected. While some information may have to be kept secret for national security purposes, a degree of transparency is needed when individuals are trying to protect their right to travel or access government services free from suspicion.

Part II of this essay briefly outlines the government's ability to …


A Feeling Of Unease About Privacy Law, Ann Bartow Jan 2006

A Feeling Of Unease About Privacy Law, Ann Bartow

Law Faculty Scholarship

This essay responds to Daniel Solove's recent article, A Taxonomy of Privacy. I have read many of Daniel Solove's privacy-related writings, and he has made many important scholarly contributions to the field. As with his previous works about privacy and the law, it is an interesting and substantive piece of work. Where it falls short, in my estimation, is in failing to label and categorize the very real harms of privacy invasions in an adequately compelling manner. Most commentators agree that compromising a person's privacy will chill certain behaviors and change others, but a powerful list of the reasons why …


Data Mining And Substandard Medical Practice: The Difference Between Privacy, Secrets And Hidden Defects, Barry R. Furrow Jan 2006

Data Mining And Substandard Medical Practice: The Difference Between Privacy, Secrets And Hidden Defects, Barry R. Furrow

Villanova Law Review

No abstract provided.


Data Mining Techniques To Study Therapy Success With Autistic Children, Gondy A. Leroy, Annika Irmscher, Marjorie H. Charlop Jan 2006

Data Mining Techniques To Study Therapy Success With Autistic Children, Gondy A. Leroy, Annika Irmscher, Marjorie H. Charlop

CGU Faculty Publications and Research

Autism spectrum disorder has become one of the most prevalent developmental disorders, characterized by a wide variety of symptoms. Many children need extensive therapy for years to improve their behavior and facilitate integration in society. However, few systematic evaluations are done on a large scale that can provide insights into how, where, and how therapy has an impact. We describe how data mining techniques can be used to provide insights into behavioral therapy as well as its effect on participants. To this end, we are developing a digital library of coded video segments that contains data on appropriate and inappropriate …


Comparison Of Data Mining And Statistical Techniques For Classification Model, Rochana Lahiri Jan 2006

Comparison Of Data Mining And Statistical Techniques For Classification Model, Rochana Lahiri

LSU Master's Theses

The purpose of this study is to observe the performance of three statistical and data mining classification models viz., logistic regression, decision tree and neural network models for different sample sizes and sampling methods on three sets of data. It is a 3 by 2 by 3 by 8 study where each statistical or data mining method has been employed to build a model for each of 8 different sample sizes and two different sampling methods. The effect of sample size on the overall performance of each model against two sets of test data are observed and compared. It is …


An Ahp Framework For Balancing Efficiency And Equity In The United States Liver Transplantation System, Vijayachandran M. Veerachandran Jan 2006

An Ahp Framework For Balancing Efficiency And Equity In The United States Liver Transplantation System, Vijayachandran M. Veerachandran

USF Tampa Graduate Theses and Dissertations

ABSRACT: Liver transplantation and allocation has been a controversial issue in the United States for decades. One of the main concerns in the allocation system is the trade-off between the two main objectives, efficiency and equity. Unfortunately, it is difficult to reach consensus on how to develop allocation policies that aim at balancing efficiency and equity, among transplantation policy makers, administrators, transplant surgeons and transplant candidates.Our research identifies and classifies the outcomes of liver allocation into two major categories, efficiency and equity, that are, often times, conflicting. Previous researchers did not consider how to balance outcomes in these two categories. …


Socioeconomic Characteristics Of Cancer Mortality In The United States Of America: A Spatial Data Mining Approach, Srinivas Kumar Vinnakota Jan 2006

Socioeconomic Characteristics Of Cancer Mortality In The United States Of America: A Spatial Data Mining Approach, Srinivas Kumar Vinnakota

LSU Doctoral Dissertations

Cancer is the second leading cause of death in the United States of America. Though it is generally known that cancer is influenced by environment, its relation to socioeconomic conditions is still widely debated. This research analyzed the spatial distribution of cancer mortalities of breast, colorectal, lung, and prostate, and their associated socioeconomic characteristics using association rule mining technique. The mortality patterns were analyzed at the county and health service area levels that corresponded to the years between 1999 – 2002 and 1988 – 1992, respectively. Distinct socioeconomic characteristics of cancer mortality were revealed by the association rule mining technique. …