Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

PDF

2010

Data mining

Discipline
Institution
Publication
Publication Type

Articles 1 - 30 of 34

Full-Text Articles in Entire DC Network

Reconstructability Analysis Of Epistasis, Martin Zwick Dec 2010

Reconstructability Analysis Of Epistasis, Martin Zwick

Systems Science Faculty Publications and Presentations

The literature on epistasis describes various methods to detect epistatic interactions and to classify different types of epistasis. Reconstructability analysis (RA) has recently been used to detect epistasis in genomic data. This paper shows that RA offers a classification of types of epistasis at three levels of resolution (variable-based models without loops, variable-based models with loops, state-based models). These types can be defined by the simplest RA structures that model the data without information loss; a more detailed classification can be defined by the information content of multiple candidate structures. The RA classification can be augmented with structures from related …


Evolutionary Strategies For Data Mining, Rose Lowe Dec 2010

Evolutionary Strategies For Data Mining, Rose Lowe

All Dissertations

Learning classifier systems (LCS) have been successful in generating rules for solving classification problems in data mining. The rules are of the form IF condition THEN action. The condition encodes the features of the input space and the action encodes the class label. What is lacking in those systems is the ability to express each feature using a function that is appropriate for that feature. The genetic algorithm is capable of doing this but cannot because only one type of membership function
is provided. Thus, the genetic algorithm learns only the shape and placement of the membership function, and in …


A Generic Framework For Context-Dependent Fusion With Application To Landmine Detection., Ahmed Chamseddine Ben Abdallah Dec 2010

A Generic Framework For Context-Dependent Fusion With Application To Landmine Detection., Ahmed Chamseddine Ben Abdallah

Electronic Theses and Dissertations

For complex detection and classification problems, involving data with large intra-class variations and noisy inputs, no single source of information can provide a satisfactory solution. As a result, combination of multiple classifiers is playing an increasing role in solving these complex pattern recognition problems, and has proven to be a viable alternative to using a single classifier. Over the past few years, a variety of schemes have been proposed for combining multiple classifiers. Most of these were global as they assign a degree of worthiness to each classifier, that is averaged over the entire training data. This may not be …


Data Mining And Analysis Of Lung Cancer Data., Guoxin Tang Dec 2010

Data Mining And Analysis Of Lung Cancer Data., Guoxin Tang

Electronic Theses and Dissertations

Lung cancer is the leading cause of cancer death in the United States and the world, with more than 1.3 million deaths worldwide per year. However, because of a lack of effective tools to diagnose Lung Cancer, more than half of all cases are diagnosed at an advanced stage, when surgical resection is unlikely to be feasible. The main purpose of this study is to examine the relationship between patient outcomes and conditions of the patients undergoing different treatments for lung cancer and to develop models to predict the mortality of lung cancer. This study will identify the demographic, finance, …


Extreme Data Mining: Inference From Small Datasets, Răzvan Andonie Sep 2010

Extreme Data Mining: Inference From Small Datasets, Răzvan Andonie

All Faculty Scholarship for the College of the Sciences

Neural networks have been applied successfully in many fields. However, satisfactory results can only be found under large sample conditions. When it comes to small training sets, the performance may not be so good, or the learning task can even not be accomplished. This deficiency limits the applications of neural network severely. The main reason why small datasets cannot provide enough information is that there exist gaps between samples, even the domain of samples cannot be ensured. Several computational intelligence techniques have been proposed to overcome the limits of learning from small datasets.

We have the following goals: i. To …


Comprehensive Evaluation Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Aditya Budi Sep 2010

Comprehensive Evaluation Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Aditya Budi

Research Collection School Of Computing and Information Systems

In statistics and data mining communities, there have been many measures proposed to gauge the strength of association between two variables of interest, such as odds ratio, confidence, Yule-Y, Yule-Q, Kappa, and gini index. These association measures have been used in various domains, for example, to evaluate whether a particular medical practice is associated positively to a cure of a disease or whether a particular marketing strategy is associated positively to an increase in revenue, etc. This paper models the problem of locating faults as association between the execution or non-execution of particular program elements with failures. There have been …


Event-Driven Similarity And Classification Of Scanpaths, Thomas Grindinger Aug 2010

Event-Driven Similarity And Classification Of Scanpaths, Thomas Grindinger

All Dissertations

Eye tracking experiments often involve recording the pattern of deployment of visual attention over the stimulus as viewers perform a given task (e.g., visual search). It is useful in training applications, for example, to make available an expert's sequence of eye movements, or scanpath, to novices for their inspection and subsequent learning. It may also be potentially useful to be able to assess the conformance of the novice's scanpath to that of the expert. A computational tool is proposed that provides a framework for performing such classification, based on the use of a probabilistic machine learning algorithm. The approach was …


Can A Computer Read A Doctor's Mind? Whether Using Data Mining As Proof In Healthcare Fraud Cases Is Consistent With The Law Of Evidence, Colin Caffrey Jul 2010

Can A Computer Read A Doctor's Mind? Whether Using Data Mining As Proof In Healthcare Fraud Cases Is Consistent With The Law Of Evidence, Colin Caffrey

Northern Illinois University Law Review

Healthcare fraud is a growing problem in the United States. Data mining is increasingly being used to combat it. After briefly explaining data mining, this article analyzes whether evidence obtained by data mining is admissible in court under the laws of evidence. It then examines the issue under both the Federal Rules of Evidence and the common law. This article focuses on three key questions: (1) Whether the use of prior acts by practitioners is proper under the law of evidence? (2) Is testimony based on data mining proper expert testimony? and (3) Does the methodology of data mining satisfy …


Rule Based, Automated Control And Compliance Systems: The Strategic Alignment Between Accounting And Data Mining, Catherine Dwyer, Susanner O'Callaghan Jul 2010

Rule Based, Automated Control And Compliance Systems: The Strategic Alignment Between Accounting And Data Mining, Catherine Dwyer, Susanner O'Callaghan

Cornerstone 3 Reports : Interdisciplinary Informatics

No abstract provided.


Predicting Polycyclic Aromatic Hydrocarbon Concentrations In Soil And Water Samples, Geoffrey Holmes, D. Fletcher, P. Reutemann Jul 2010

Predicting Polycyclic Aromatic Hydrocarbon Concentrations In Soil And Water Samples, Geoffrey Holmes, D. Fletcher, P. Reutemann

International Congress on Environmental Modelling and Software

Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide …


Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina Jul 2010

Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina

International Congress on Environmental Modelling and Software

One of the most difficult tasks in the whole KDD process is to choose the right data mining technique, as the commercial software tools provide more and more possibilities together and the decision requires more and more expertise on the methodological point of view. Indeed, there are a lot of data mining techniques available for an environmental scientist wishing to discover some model from her/his data. This diversity can cause some troubles to the scientist who often have not a clear idea of what are the available methods, and moreover, use to have doubts about the most suitable method to …


Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina Jul 2010

Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina

International Congress on Environmental Modelling and Software

One of the most difficult tasks in the whole KDD process is to choose the right data mining technique, as the commercial software tools provide more and more possibilities together and the decision requires more and more expertise on the methodological point of view. Indeed, there are a lot of data mining techniques available for an environmental scientist wishing to discover some model from her/his data. This diversity can cause some troubles to the scientists who often have not a clear idea of what are the available methods, and moreover, frequently have doubts about the most suitable method to be …


The Tasks Of Pre And Post-Processing In Data Mining Applied To A Real World Problem, José Luis Díaz, M. Herrera, Joaquín Izquierdo, Rafael Pérez-García Jul 2010

The Tasks Of Pre And Post-Processing In Data Mining Applied To A Real World Problem, José Luis Díaz, M. Herrera, Joaquín Izquierdo, Rafael Pérez-García

International Congress on Environmental Modelling and Software

Pre and post-processing are crucial tasks in Knowledge Discovery in Databases (KDD). In this contribution we present an application to a data set from a real water supply network (WSN) in the town of Calarcá (Colombia), located in the so-called "Eje Cafetero" coffee region. We use traditional and well-known techniques of pre and post-processing with the aim of showing its importance in Data Mining (DM), and of enhancing the need of results interpretability when dealing with real data set. Pre and post-processing tools, as well as other DM tasks implemented in Clementine 9.0 (SPSS), have been used. Clementine 9.0 has …


Predicting Polycyclic Aromatic Hydrocarbon Concentrations In Soil And Water Samples, Geoffrey Holmes, D. Fletcher, P. Reutemann Jul 2010

Predicting Polycyclic Aromatic Hydrocarbon Concentrations In Soil And Water Samples, Geoffrey Holmes, D. Fletcher, P. Reutemann

International Congress on Environmental Modelling and Software

Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide …


Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina Jul 2010

Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina

International Congress on Environmental Modelling and Software

One of the most difficult tasks in the whole KDD process is to choose the right data mining technique, as the commercial software tools provide more and more possibilities together and the decision requires more and more expertise on the methodological point of view. Indeed, there are a lot of data mining techniques available for an environmental scientist wishing to discover some model from her/his data. This diversity can cause some troubles to the scientist who often have not a clear idea of what are the available methods, and moreover, use to have doubts about the most suitable method to …


Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina Jul 2010

Choosing The Right Data Mining Technique: Classification Of Methods And Intelligent Recommendation, Karina Gibert, Miquel Sànchez-Marrè, Víctor Codina

International Congress on Environmental Modelling and Software

One of the most difficult tasks in the whole KDD process is to choose the right data mining technique, as the commercial software tools provide more and more possibilities together and the decision requires more and more expertise on the methodological point of view. Indeed, there are a lot of data mining techniques available for an environmental scientist wishing to discover some model from her/his data. This diversity can cause some troubles to the scientists who often have not a clear idea of what are the available methods, and moreover, frequently have doubts about the most suitable method to be …


The Tasks Of Pre And Post-Processing In Data Mining Applied To A Real World Problem, José Luis Díaz, M. Herrera, Joaquín Izquierdo, Rafael Pérez-García Jul 2010

The Tasks Of Pre And Post-Processing In Data Mining Applied To A Real World Problem, José Luis Díaz, M. Herrera, Joaquín Izquierdo, Rafael Pérez-García

International Congress on Environmental Modelling and Software

Pre and post-processing are crucial tasks in Knowledge Discovery in Databases (KDD). In this contribution we present an application to a data set from a real water supply network (WSN) in the town of Calarcá (Colombia), located in the so-called "Eje Cafetero" coffee region. We use traditional and well-known techniques of pre and post-processing with the aim of showing its importance in Data Mining (DM), and of enhancing the need of results interpretability when dealing with real data set. Pre and post-processing tools, as well as other DM tasks implemented in Clementine 9.0 (SPSS), have been used. Clementine 9.0 has …


Where In The World? Demographic Patterns In Access Data, Mimi Recker, Beijie Xu, Sherry Hsi, Christine Garrard Jun 2010

Where In The World? Demographic Patterns In Access Data, Mimi Recker, Beijie Xu, Sherry Hsi, Christine Garrard

Instructional Technology and Learning Sciences Faculty Publications

Standard webmetrics tools record the IP address of users’ computers, thereby providing fodder for analyses of their geographical location, and for understanding the impact of e-learning and teaching. Here we describe how two web-based educational systems were engineered to collect geo-referenced data. This is followed by a description of joining these data with demographic and educational datasets for the United States, and mapping different datasets using geographic information system (GIS) techniques to visually display their relationships. Results from statistical analyses of these relationships that highlight areas of significance are given.


An Iterative Feature Perturbation Method For Gene Selection From Microarray Data, Juana Canul Reich Jun 2010

An Iterative Feature Perturbation Method For Gene Selection From Microarray Data, Juana Canul Reich

USF Tampa Graduate Theses and Dissertations

Gene expression microarray datasets often consist of a limited number of samples relative to a large number of expression measurements, usually on the order of thousands of genes. These characteristics pose a challenge to any classification model as they might negatively impact its prediction accuracy. Therefore, dimensionality reduction is a core process prior to any classification task.

This dissertation introduces the iterative feature perturbation method (IFP), an embedded gene selector that iteratively discards non-relevant features. IFP considers relevant features as those which after perturbation with noise cause a change in the predictive accuracy of the classification model. Non-relevant features do …


Stevent: Spatio-Temporal Event Model For Social Network Discovery, Hady W. Lauw, Ee Peng Lim, Hwee Hwa Pang, Teck-Tim Tan Jun 2010

Stevent: Spatio-Temporal Event Model For Social Network Discovery, Hady W. Lauw, Ee Peng Lim, Hwee Hwa Pang, Teck-Tim Tan

Research Collection School Of Computing and Information Systems

Spatio-temporal data concerning the movement of individuals over space and time contains latent information on the associations among these individuals. Sources of spatio-temporal data include usage logs of mobile and Internet technologies. This article defines a spatio-temporal event by the co-occurrences among individuals that indicate potential associations among them. Each spatio-temporal event is assigned a weight based on the precision and uniqueness of the event. By aggregating the weights of events relating two individuals, we can determine the strength of association between them. We conduct extensive experimentation to investigate both the efficacy of the proposed model as well as the …


Knowledge Discovery And Analysis In Manufacturing, Mark Polczynski, Andzrej Kochanski Jun 2010

Knowledge Discovery And Analysis In Manufacturing, Mark Polczynski, Andzrej Kochanski

Electrical and Computer Engineering Faculty Research and Publications

The quality and reliability requirements for next-generation manufacturing are reviewed, and current approaches are cited. The potential for augmenting current quality/reliability technology is described, and characteristics of potential future directions are postulated. Methods based on knowledge discovery and analysis in manufacturing (KDAM) are reviewed, and related successful applications in business and social fields are discussed. Typical KDAM applications are noted, along with general functions and specific KDAM-related technologies. A systematic knowledge discovery process model is reviewed, and examples of current work are given, including description of successful applications of KDAM to creation of rules for optimizing gas porosity in sand …


Mimosa: A System For Minimotif Annotation, Jay Vyas, Ronald J. Nowling, Thomas Meusburger, David P. Sargeant, Krishna Kadaveru, Michael R. Gryk, Vamsi Kundeti, Sanguthevar Rajasekaran, Martin Schiller May 2010

Mimosa: A System For Minimotif Annotation, Jay Vyas, Ronald J. Nowling, Thomas Meusburger, David P. Sargeant, Krishna Kadaveru, Michael R. Gryk, Vamsi Kundeti, Sanguthevar Rajasekaran, Martin Schiller

Life Sciences Faculty Research

BACKGROUND:

Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature.

RESULTS:

We have built the MimoSA application for minimotif annotation. The application supports management of …


Enterprise Users And Web Search Behavior, April Ann Lewis May 2010

Enterprise Users And Web Search Behavior, April Ann Lewis

Masters Theses

This thesis describes analysis of user web query behavior associated with Oak Ridge National Laboratory’s (ORNL) Enterprise Search System (Hereafter, ORNL Intranet). The ORNL Intranet provides users a means to search all kinds of data stores for relevant business and research information using a single query. The Global Intranet Trends for 2010 Report suggests the biggest current obstacle for corporate intranets is “findability and Siloed content”. Intranets differ from internets in the way they create, control, and share content which can make it often difficult and sometimes impossible for users to find information. Stenmark (2006) first noted studies of corporate …


Dynamic Application Level Security Sensors, Christopher Thomas Rathgeb May 2010

Dynamic Application Level Security Sensors, Christopher Thomas Rathgeb

Masters Theses

The battle for cyber supremacy is a cat and mouse game: evolving threats from internal and external sources make it difficult to protect critical systems. With the diverse and high risk nature of these threats, there is a need for robust techniques that can quickly adapt and address this evolution. Existing tools such as Splunk, Snort, and Bro help IT administrators defend their networks by actively parsing through network traffic or system log data. These tools have been thoroughly developed and have proven to be a formidable defense against many cyberattacks. However, they are vulnerable to zero-day attacks, slow attacks, …


Partitioning Of Minimotifs Based On Function With Improved Prediction Accuracy, Sanguthevar Rajasekaran, Tian Mi, Jerlin Camilus Merlin, Aaron Oommen, Patrick R. Gradie, Martin R. Schiller Apr 2010

Partitioning Of Minimotifs Based On Function With Improved Prediction Accuracy, Sanguthevar Rajasekaran, Tian Mi, Jerlin Camilus Merlin, Aaron Oommen, Patrick R. Gradie, Martin R. Schiller

Life Sciences Faculty Research

Background

Minimotifs are short contiguous peptide sequences in proteins that are known to have a function in at least one other protein. One of the principal limitations in minimotif prediction is that false positives limit the usefulness of this approach. As a step toward resolving this problem we have built, implemented, and tested a new data-driven algorithm that reduces false-positive predictions.

Methodology/Principal Findings

Certain domains and minimotifs are known to be strongly associated with a known cellular process or molecular function. Therefore, we hypothesized that by restricting minimotif predictions to those where the minimotif containing protein and target protein have …


Leashing The Internet Watchdog: Legislative Restraints On Electronic Surveillance In The U.S. And U.K., John P. Heekin Apr 2010

Leashing The Internet Watchdog: Legislative Restraints On Electronic Surveillance In The U.S. And U.K., John P. Heekin

John P. Heekin

This article examines the legislative approaches undertaken by the United States and the United Kingdom to regulate the surveillance and interception of electronic communications. Drawing from the recognition of individual privacy in each country, the author explores the development and impact of statutory provisions enacted to accomplish effective oversight of the respective intelligence services. In the U.S., the shifting purposes and provisions of the Foreign Intelligence Surveillance Act of 1978 are tracked from implementation to its revisions following the terrorist attacks of September 11, 2001. Along that timeline, a distinct trend toward greater deference to Executive authority for electronic surveillance …


Enhancement Of Churn Prediction Algorithms, Matthew N. Anyanwu Mar 2010

Enhancement Of Churn Prediction Algorithms, Matthew N. Anyanwu

Electronic Theses and Dissertations

Customer churn can be described as the process by which consumers of goods and services discontinue the consumption of a product or service and switch over to a competitor.It is of great concern to many companies. Thus, decision support systems are needed to overcome this pressing issue and ensure good return on investments for organizations. Decision support systems use analytical models to provide the needed intelligence to analyze an integrated customer record database to predict customers that will churn and offer recommendations that will prevent them from churning. Customers churn prediction, unlike most conventional business intelligence techniques, deals with customer …


Reconstructability Analysis As A Tool For Identifying Gene-Gene Interactions In Studies Of Human Diseases, Stephen Shervais, Patricia L. Kramer, Shawn K. Westaway, Nancy J. Cox, Martin Zwick Mar 2010

Reconstructability Analysis As A Tool For Identifying Gene-Gene Interactions In Studies Of Human Diseases, Stephen Shervais, Patricia L. Kramer, Shawn K. Westaway, Nancy J. Cox, Martin Zwick

Systems Science Faculty Publications and Presentations

There are a number of common human diseases for which the genetic component may include an epistatic interaction of multiple genes. Detecting these interactions with standard statistical tools is difficult because there may be an interaction effect, but minimal or no main effect. Reconstructability analysis (RA) uses Shannon’s information theory to detect relationships between variables in categorical datasets. We applied RA to simulated data for five different models of gene-gene interaction, and find that even with heritability levels as low as 0.008, and with the inclusion of 50 non-associated genes in the dataset, we can identify the interacting gene pairs …


A Study Of Factors Contributing To Self-Reported Anomalies In Civil Aviation, Chris Andrzejczak Jan 2010

A Study Of Factors Contributing To Self-Reported Anomalies In Civil Aviation, Chris Andrzejczak

Electronic Theses and Dissertations

A study investigating what factors are present leading to pilots submitting voluntary anomaly reports regarding their flight performance was conducted. The study employed statistical methods, text mining, clustering, and dimensional reduction techniques in an effort to determine relationships between factors and anomalies. A review of the literature was conducted to determine what factors are contributing to these anomalous incidents, as well as what research exists on human error, its causes, and its management. Data from the NASA Aviation Safety Reporting System (ASRS) was analyzed using traditional statistical methods such as frequencies and multinomial logistic regression. Recently formalized approaches in text …


The Meaning And The Mining Of Legal Texts, Mireille Hildebrandt Jan 2010

The Meaning And The Mining Of Legal Texts, Mireille Hildebrandt

Mireille Hildebrandt

Positive law, inscribed in legal texts, entails an authority not inherent in literary texts, generating legal consequences that can have real effects on a person’s life and liberty. The interpretation of legal texts, necessarily a normative undertaking, resists the mechanical application of rules, though still requiring a measure of predictability, coherence with other relevant legal norms and compliance with constitutional safeguards. The present proliferation of legal texts on the internet (codes, statutes, judgments, treaties, doctrinal treatises) renders the selection of relevant texts and cases next to impossible. We may expect that systems to mine these texts to find arguments that …