Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

2012

Discipline
Institution
Publication
Publication Type
File Type

Articles 1 - 22 of 22

Full-Text Articles in Computer Sciences

Human-Readable Real-Time Classifications Of Malicious Executables, Anselm Teh, Arran Stewart Dec 2012

Human-Readable Real-Time Classifications Of Malicious Executables, Anselm Teh, Arran Stewart

Australian Information Security Management Conference

Shafiq et al. (2009a) propose a non–signature-based technique for detecting malware which applies data mining techniques to features extracted from executable files. Their technique has a high level of accuracy, a low false positive rate, and a speed on par with commercial anti-virus products. One portion of their technique uses a multi-layer perceptron as a classifier, which provides little insight into the reasons for classification. Our experience is that network security analysts prefer tools which provide human-comprehensible reasons for a classification, rather than operating as “black boxes”. We therefore build on the results of Shafiq et al. by demonstrating a …


Data Mining Of Pancreatic Cancer Protein Databases, Peter Revesz, Christopher Assi Dec 2012

Data Mining Of Pancreatic Cancer Protein Databases, Peter Revesz, Christopher Assi

CSE Conference and Workshop Papers

Data mining of protein databases poses special challenges because many protein databases are non- relational whereas most data mining and machine learning algorithms assume the input data to be a type of rela- tional database that is also representable as an ARFF file. We developed a method to restructure protein databases so that they become amenable for various data mining and machine learning tools. Our restructuring method en- abled us to apply both decision tree and support vector machine classifiers to a pancreatic protein database. The SVM classifier that used both GO term and PFAM families to characterize proteins gave …


Exploring Place Through User-Generated Content: Using Flickr Tags To Describe City Cores, Livia Hollenstein, Ross Purves Oct 2012

Exploring Place Through User-Generated Content: Using Flickr Tags To Describe City Cores, Livia Hollenstein, Ross Purves

Journal of Spatial Information Science

Terms used to describe city centers, such as Downtown, are key concepts in everyday or vernacular language. Here, we explore such language by harvesting georeferenced and tagged metadata associated with 8 million Flickr images and thus consider how large numbers of people name city core areas. The nature of errors and imprecision in tagging and georeferencing are quantified, and automatically generated precision measures appear to mirror errors in the positioning of images. Users seek to ascribe appropriate semantics to images, though bulk-uploading and bulk-tagging may introduce bias. Between 0.5--2% of tags associated with georeferenced images analyzed describe city core areas …


Adaptive Grid Based Localized Learning For Multidimensional Data, Sheetal Saini Oct 2012

Adaptive Grid Based Localized Learning For Multidimensional Data, Sheetal Saini

Doctoral Dissertations

Rapid advances in data-rich domains of science, technology, and business has amplified the computational challenges of "Big Data" synthesis necessary to slow the widening gap between the rate at which the data is being collected and analyzed for knowledge. This has led to the renewed need for efficient and accurate algorithms, framework, and algorithmic mechanisms essential for knowledge discovery, especially in the domains of clustering, classification, dimensionality reduction, feature ranking, and feature selection. However, data mining algorithms are frequently challenged by the sparseness due to the high dimensionality of the datasets in such domains which is particularly detrimental to the …


Semi-Automatic Simulation Initialization By Mining Structured And Unstructured Data Formats From Local And Web Data Sources, Olcay Sahin Oct 2012

Semi-Automatic Simulation Initialization By Mining Structured And Unstructured Data Formats From Local And Web Data Sources, Olcay Sahin

Computational Modeling & Simulation Engineering Theses & Dissertations

Initialization is one of the most important processes for obtaining successful results from a simulation. However, initialization is a challenge when 1) a simulation requires hundreds or even thousands of input parameters or 2) re-initializing the simulation due to different initial conditions or runtime errors. These challenges lead to the modeler spending more time initializing a simulation and may lead to errors due to poor input data.

This thesis proposes two semi-automatic simulation initialization approaches that provide initialization using data mining from structured and unstructured data formats from local and web data sources. First, the System Initialization with Retrieval (SIR) …


Building A Computer Program To Support Children, Parents, And Distraction During Healthcare Procedures, Kirsten Hanrahan, Ann Marie Mccarthy, Charmaine Kleiber, Kaan Ataman, W. Nick Street, M. Bridget Zimmerman, Annel L. Ersig Oct 2012

Building A Computer Program To Support Children, Parents, And Distraction During Healthcare Procedures, Kirsten Hanrahan, Ann Marie Mccarthy, Charmaine Kleiber, Kaan Ataman, W. Nick Street, M. Bridget Zimmerman, Annel L. Ersig

Business Faculty Articles and Research

This secondary data analysis used data mining methods to develop predictive models of child risk for distress during a healthcare procedure. Data used came from a study that predicted factors associated with children's responses to an intravenous catheter insertion while parents provided distraction coaching. From the 255 items used in the primary study, 44 predictive items were identified through automatic feature selection and used to build support vector machine regression models. Models were validated using multiple cross-validation tests and by comparing variables identified as explanatory in the traditional versus support vector machine regression. Rule-based approaches were applied to the model …


A Confidence-Prioritization Approach To Data Processing In Noisy Data Sets And Resulting Estimation Models For Predicting Streamflow Diel Signals In The Pacific Northwest, Nathaniel Lee Gustafson Aug 2012

A Confidence-Prioritization Approach To Data Processing In Noisy Data Sets And Resulting Estimation Models For Predicting Streamflow Diel Signals In The Pacific Northwest, Nathaniel Lee Gustafson

Theses and Dissertations

Streams in small watersheds are often known to exhibit diel fluctuations, in which streamflow oscillates on a 24-hour cycle. Streamflow diel fluctuations, which we investigate in this study, are an informative indicator of environmental processes. However, in Environmental Data sets, as well as many others, there is a range of noise associated with individual data points. Some points are extracted under relatively clear and defined conditions, while others may include a range of known or unknown confounding factors, which may decrease those points' validity. These points may or may not remain useful for training, depending on how much uncertainty they …


From Clickstreams To Searchstreams: Search Network Graph Evidence From A B2b E-Market, Mei Lin, M. F. Lin, Robert J. Kauffman Aug 2012

From Clickstreams To Searchstreams: Search Network Graph Evidence From A B2b E-Market, Mei Lin, M. F. Lin, Robert J. Kauffman

Research Collection School Of Computing and Information Systems

Consumers in e-commerce acquire information through search engines, yet to date there has been little empirical study on how users interact with the results produced by search engines. This is analogous to, but different from, the ever-expanding research on clickstreams, where users interact with static web pages. We propose a new network approach to analyzing search engine server log data. We call this searchstream data. We create graph representations based on the web pages that users traverse as they explore the search results that their use of search engines generates. We then analyze the graph-level properties of these search network …


Data Mining Of Protein Databases, Christopher Assi Jul 2012

Data Mining Of Protein Databases, Christopher Assi

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Data mining of protein databases poses special challenges because many protein databases are non-relational whereas most data mining and machine learning algorithms assume the input data to be a relational database. Protein databases are non-relational mainly because they often contain set data types. We developed new data mining algorithms that can restructure non-relational protein databases so that they become relational and amenable for various data mining and machine learning tools. We applied the new restructuring algorithms to a pancreatic protein database. After the restructuring, we also applied two classification methods, such as decision tree and SVM classifiers and compared their …


Mining Input Sanitization Patterns For Predicting Sql Injection And Cross Site Scripting Vulnerabilities, Lwin Khin Shar, Hee Beng Kuan Tan Jun 2012

Mining Input Sanitization Patterns For Predicting Sql Injection And Cross Site Scripting Vulnerabilities, Lwin Khin Shar, Hee Beng Kuan Tan

Research Collection School Of Computing and Information Systems

Static code attributes such as lines of code and cyclomatic complexity have been shown to be useful indicators of defects in software modules. As web applications adopt input sanitization routines to prevent web security risks, static code attributes that represent the characteristics of these routines may be useful for predicting web application vulnerabilities. In this paper, we classify various input sanitization methods into different types and propose a set of static code attributes that represent these types. Then we use data mining methods to predict SQL injection and cross site scripting vulnerabilities in web applications. Preliminary experiments show that our …


Data Mining Of Tetraloop-Tetraloop Receptors In Rna Xml Files, Sinan Ramazanoglu May 2012

Data Mining Of Tetraloop-Tetraloop Receptors In Rna Xml Files, Sinan Ramazanoglu

Theses

RNA (Ribonucleic acid) Motifs are tertiary structures that play an important role in the folding mechanism of the RNA molecule. The overall function of a RNA Motif depends on its specific bp (base pairs) sequence that constitutes the secondary structure. Data mining is a novel method in both discovering potential tertiary structures within DNA (Deoxyribonucleic acid), RNA, and protein molecules and storing the information in databases. The RNA Motif of interest is the tetraloop-tetraloop receptor, which is composed of a highly conserved 11 nt (nucleotide) sequence and a tetraloop with the generic form of GNRA (where N = any base …


Ensemble Of Feature Selection Techniques For High Dimensional Data, Sri Harsha Vege May 2012

Ensemble Of Feature Selection Techniques For High Dimensional Data, Sri Harsha Vege

Masters Theses & Specialist Projects

Data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Feature selection is an important preprocessing step of data mining that helps increase the predictive performance of a model. The main aim of feature selection is to choose a subset of features with high predictive information and eliminate irrelevant features with little or no predictive information. Using a single feature selection technique may generate local optima.

In this thesis we propose an ensemble approach for feature selection, where multiple …


Analysis And Characterization Of Author Contribution Patterns In Open Source Software Development, Quinn Carlson Taylor Mar 2012

Analysis And Characterization Of Author Contribution Patterns In Open Source Software Development, Quinn Carlson Taylor

Theses and Dissertations

Software development is a process fraught with unpredictability, in part because software is created by people. Human interactions add complexity to development processes, and collaborative development can become a liability if not properly understood and managed. Recent years have seen an increase in the use of data mining techniques on publicly-available repository data with the goal of improving software development processes, and by extension, software quality. In this thesis, we introduce the concept of author entropy as a metric for quantifying interaction and collaboration (both within individual files and across projects), present results from two empirical observational studies of open-source …


Applying Data Mining Techniques In The Selection Of Plant Traits, Dean Diepeveen, Leisa Armstrong Feb 2012

Applying Data Mining Techniques In The Selection Of Plant Traits, Dean Diepeveen, Leisa Armstrong

Leisa Armstrong

In the agricultural sector, farmers are provided with crop related information by various research agencies in order to make critical decisions about which is the most profitable crop variety choice. Research agencies provide information which is generic, rather than being tailored to the individual farmers cropping situation. A number of specific plant and growth traits are used to establish the most suitable crop varieties. When selecting crop varieties for release to growers, the application of data mining techniques to crop research data enables the customization of information to each individual farmers farming situation. The challenge for agricultural research perspective is …


An Evaluation Of Methodologies For Eagriculture In An Australian Context, Leisa Armstrong, Dean Diepeveen Feb 2012

An Evaluation Of Methodologies For Eagriculture In An Australian Context, Leisa Armstrong, Dean Diepeveen

Leisa Armstrong

Australian agricultural producers’ profits are dependent on the decisions they make about farm productivity systems. They may use recommendations and information provided by government agencies and private consultants. For cereal growers, success is dependent on decisions made about selection of crop varieties suitable for their agronomic and climatic conditions. This paper reports on research which aimed to evaluate some current eAgriculture methodologies for their application in the Western Australian agricultural industry. In particular the paper illustrates the findings from a project which aimed to explain the variability seen in crop varieties grown in Western Australia. The problems associated with crop …


An Eagriculture-Based Decision Support Framework For Information Dissemination, Leisa Armstrong, Dean Diepeveen, Khumphicha Tantisantisom Feb 2012

An Eagriculture-Based Decision Support Framework For Information Dissemination, Leisa Armstrong, Dean Diepeveen, Khumphicha Tantisantisom

Leisa Armstrong

The ability of farmers to acquire knowledge to make decisions is limited by the information quality and applicability. Inconsistencies information delivery and standards/or the integration o/information also limit decision making processes. This research uses a similar approach to the Knowledge Discovery in Databases (KDD) methodology to develop an ICT based framework which can be used to facilitate the acquisition of knowledge for farmer's' decision making processes. This is one of the leading areas of research and development for information technology in an agricultural industry, which is yet to utilize such technologies fully. The Farmer Knowledge and Decision Support Framework (FKDSF) …


An Information-Based Decision Support Framework For Eagriculture, Leisa Armstrong, Dean Diepeveen Feb 2012

An Information-Based Decision Support Framework For Eagriculture, Leisa Armstrong, Dean Diepeveen

Leisa Armstrong

The ability of farmers to acquire knowledge to make decisions is limited by the information quality and applicability. An inconsistency in information delivery and standards for the integration of information also limits the decision making process. Knowledge Discovery in Databases (KDD) methodology described for the data mining is an example of how frameworks can be used to facilitate such data integration. This research will examine how such a ICT based framework can be used to facilitate the acquisition of knowledge for the farmer decision making process. The Farmer Knowledge and Decision Support Framework (FKDSF) takes information provided to farmers and …


Computer Methods For Pre-Microrna Secondary Structure Prediction, Dianwei Han Jan 2012

Computer Methods For Pre-Microrna Secondary Structure Prediction, Dianwei Han

Theses and Dissertations--Computer Science

This thesis presents a new algorithm to predict the pre-microRNA secondary structure. An accurate prediction of the pre-microRNA secondary structure is important in miRNA informatics. Based on a recently proposed model, nucleotide cyclic motifs (NCM), to predict RNA secondary structure, we propose and implement a Modified NCM (MNCM) model with a physics-based scoring strategy to tackle the problem of pre-microRNA folding. Our microRNAfold is implemented using a global optimal algorithm based on the bottom-up local optimal solutions.

It has been shown that studying the functions of multiple genes and predicting the secondary structure of multiple related microRNA is more important …


Medical Data Analysis Method For Epilepsy, Ameen Eetemadi Jan 2012

Medical Data Analysis Method For Epilepsy, Ameen Eetemadi

Wayne State University Theses

Applying data mining techniques on medical databases which contain un-structured and semi-structured data is a challenging task. It is not only due to the complexity of such databases but also due to the characteristics of the medical domain. This thesis describes how multiple layers of data mining techniques have been applied to a Human Brain Image Database system. It starts with data preparation which paves the way for conventional data analysis techniques to be applied to the data. A similarity based patient retrieval tool has been designed and developed to assist in treatment planning and outcome estimation for epileptic patients. …


Decision Rule Induction For Service Sector Using Data Mining- A Rough Set Theory Approach, Zhonghua Hu Jan 2012

Decision Rule Induction For Service Sector Using Data Mining- A Rough Set Theory Approach, Zhonghua Hu

Open Access Theses & Dissertations

Nowadays, data mining is more widely used than ever before; not only by the academic area, but also in the industry and business area. Apart from execution of business processes, the creation of knowledge base and its utilization for the benefit of the organization is becoming a strategy tool to compete. Despite of having ever growing data bases, the problem is that the finance company fails to fully capitalize the true benefits which can be gained from this great wealth of information. The data mining technology instead of classic statistical analysis is developed to help the people to discover the …


Redistricting Using Constrained Polygonal Clustering, Deepti Joshi, Leen-Kiat Soh, Ashok Samal Jan 2012

Redistricting Using Constrained Polygonal Clustering, Deepti Joshi, Leen-Kiat Soh, Ashok Samal

School of Computing: Faculty Publications

Redistricting is the process of dividing a geographic area consisting of spatial units—often represented as spatial polygons—into smaller districts that satisfy some properties. It can therefore be formulated as a set partitioning problem where the objective is to cluster the set of spatial polygons into groups such that a value function is maximized [1]. Widely used algorithms developed for point-based data sets are not readily applicable because polygons introduce the concepts of spatial contiguity and other topological properties that cannot be captured by representing polygons as points. Furthermore, when clustering polygons, constraints such as spatial contiguity and unit distributedness should …


Hypotheses Generation As Supervised Link Discovery With Automated Class Labeling On Large-Scale Biomedical Concept Networks, Jayasimha R. Katukuri, Ying Xie, Vijay Raghavan, Ashish Gupta Jan 2012

Hypotheses Generation As Supervised Link Discovery With Automated Class Labeling On Large-Scale Biomedical Concept Networks, Jayasimha R. Katukuri, Ying Xie, Vijay Raghavan, Ashish Gupta

Faculty and Research Publications

Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce framework. We extract a set of heterogeneous features such as random …