Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

Theses/Dissertations

Discipline
Institution
Publication Year
Publication

Articles 91 - 117 of 117

Full-Text Articles in Computer Sciences

Evolutionary Strategies For Data Mining, Rose Lowe Dec 2010

Evolutionary Strategies For Data Mining, Rose Lowe

All Dissertations

Learning classifier systems (LCS) have been successful in generating rules for solving classification problems in data mining. The rules are of the form IF condition THEN action. The condition encodes the features of the input space and the action encodes the class label. What is lacking in those systems is the ability to express each feature using a function that is appropriate for that feature. The genetic algorithm is capable of doing this but cannot because only one type of membership function
is provided. Thus, the genetic algorithm learns only the shape and placement of the membership function, and in …


Event-Driven Similarity And Classification Of Scanpaths, Thomas Grindinger Aug 2010

Event-Driven Similarity And Classification Of Scanpaths, Thomas Grindinger

All Dissertations

Eye tracking experiments often involve recording the pattern of deployment of visual attention over the stimulus as viewers perform a given task (e.g., visual search). It is useful in training applications, for example, to make available an expert's sequence of eye movements, or scanpath, to novices for their inspection and subsequent learning. It may also be potentially useful to be able to assess the conformance of the novice's scanpath to that of the expert. A computational tool is proposed that provides a framework for performing such classification, based on the use of a probabilistic machine learning algorithm. The approach was …


Enterprise Users And Web Search Behavior, April Ann Lewis May 2010

Enterprise Users And Web Search Behavior, April Ann Lewis

Masters Theses

This thesis describes analysis of user web query behavior associated with Oak Ridge National Laboratory’s (ORNL) Enterprise Search System (Hereafter, ORNL Intranet). The ORNL Intranet provides users a means to search all kinds of data stores for relevant business and research information using a single query. The Global Intranet Trends for 2010 Report suggests the biggest current obstacle for corporate intranets is “findability and Siloed content”. Intranets differ from internets in the way they create, control, and share content which can make it often difficult and sometimes impossible for users to find information. Stenmark (2006) first noted studies of corporate …


The Impact Of Overfitting And Overgeneralization On The Classification Accuracy In Data Mining, Huy Nguyen Anh Pham Jan 2010

The Impact Of Overfitting And Overgeneralization On The Classification Accuracy In Data Mining, Huy Nguyen Anh Pham

LSU Doctoral Dissertations

Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. …


Detecting Malicious Software By Dynamicexecution, Jianyong Dai Jan 2009

Detecting Malicious Software By Dynamicexecution, Jianyong Dai

Electronic Theses and Dissertations

Traditional way to detect malicious software is based on signature matching. However, signature matching only detects known malicious software. In order to detect unknown malicious software, it is necessary to analyze the software for its impact on the system when the software is executed. In one approach, the software code can be statically analyzed for any malicious patterns. Another approach is to execute the program and determine the nature of the program dynamically. Since the execution of malicious code may have negative impact on the system, the code must be executed in a controlled environment. For that purpose, we have …


Parallel Mining Of Association Rules Using A Lattice Based Approach, Wessel Morant Thomas Jan 2009

Parallel Mining Of Association Rules Using A Lattice Based Approach, Wessel Morant Thomas

CCE Theses and Dissertations

The discovery of interesting patterns from database transactions is one of the major problems in knowledge discovery in database. One such interesting pattern is the association rules extracted from these transactions. Parallel algorithms are required for the mining of association rules due to the very large databases used to store the transactions. In this paper we present a parallel algorithm for the mining of association rules. We implemented a parallel algorithm that used a lattice approach for mining association rules. The Dynamic Distributed Rule Mining (DDRM) is a lattice-based algorithm that partitions the lattice into sublattices to be assigned to …


Effects Of Similarity Metrics On Document Clustering, Rushikesh Veni Jan 2009

Effects Of Similarity Metrics On Document Clustering, Rushikesh Veni

UNLV Theses, Dissertations, Professional Papers, and Capstones

Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator.

For the thesis, we evaluate the effects a similarity function may have on clustering. We start by representing a document and a query, both as a vector of high-dimensional space corresponding to the keywords followed by using an appropriate distance measure in k-means to compute similarity between the document vector and the query vector to …


Investigating Data Mining Techniques For Extracting Information From Alzheimer's Disease Data, Vinh Quoc Dang Jan 2009

Investigating Data Mining Techniques For Extracting Information From Alzheimer's Disease Data, Vinh Quoc Dang

Theses : Honours

Data mining techniques have been used widely in many areas such as business, science, engineering and more recently in clinical medicine. These techniques allow an enormous amount of high dimensional data to be analysed for extraction of interesting information as well as the construction of models for prediction. One of the foci in health related research is Alzheimer's disease which is currently a non-curable disease where diagnosis can only be confirmed after death via an autopsy. Using multi-dimensional data and the applications of data mining techniques, researchers hope to find biomarkers that will diagnose Alzheimer's disease as early as possible. …


Bootstrapping Events And Relations From Text, Ting Liu Jan 2009

Bootstrapping Events And Relations From Text, Ting Liu

Legacy Theses & Dissertations (2009 - 2024)

Information Extraction (IE) is a technique for automatically extracting structured data from text documents. One of the key analytical tasks is extraction of important and relevant information from textual sources. While information is plentiful and readily available, from the Internet, news services, media, etc., extracting the critical nuggets that matter to business or to national security is a cognitively demanding and time consuming task. Intelligence and business analysts spend many hours poring over endless streams of text documents pulling out reference to entities of interest (people, locations, organizations) as well as their relationships as reported in text. Such extracted "information …


Data Exploration By Using The Monotonicity Property, Hongyi Chen Jan 2008

Data Exploration By Using The Monotonicity Property, Hongyi Chen

LSU Master's Theses

Dealing with different misclassification costs has been a big problem for classification. Some algorithms can predict quite accurately when assuming the misclassification costs for each class are the same, like most rule induction methods. However, when the misclassification costs change, which is a common phenomenon in reality, these algorithms are not capable of adjusting their results. Some other algorithms, like the Bayesian methods, have the ability to yield probabilities of a certain unclassified example belonging to given classes, which is helpful to make modification on the results according to different misclassification costs. The shortcoming of such algorithms is, when the …


Structure Pattern Analysis Using Term Rewriting And Clustering Algorithm, Xuezheng Fu Jun 2007

Structure Pattern Analysis Using Term Rewriting And Clustering Algorithm, Xuezheng Fu

Computer Science Dissertations

Biological data is accumulated at a fast pace. However, raw data are generally difficult to understand and not useful unless we unlock the information hidden in the data. Knowledge/information can be extracted as the patterns or features buried within the data. Thus data mining, aims at uncovering underlying rules, relationships, and patterns in data, has emerged as one of the most exciting fields in computational science. In this dissertation, we develop efficient approaches to the structure pattern analysis of RNA and protein three dimensional structures. The major techniques used in this work include term rewriting and clustering algorithms. Firstly, a …


An Investigation Into The Application Of Data Mining Techniques To Characterize Agricultural Soil Profiles, Rowan J. Maddern Jan 2007

An Investigation Into The Application Of Data Mining Techniques To Characterize Agricultural Soil Profiles, Rowan J. Maddern

Theses : Honours

The advances in computing and information storage have provided vast amounts of data. The challenge has been to extract knowledge from this raw data; this has led to new methods and techniques such as data mining that can bridge the knowledge gap. The research aims to use these new data mining techniques and apply them to a soil science database to establish if meaningful relationships can be found. A data set extracted from the WA Department of Agriculture and Food (DAFW A) soils database has been used to conduct this research. The database contains measurements of soil profile data from …


Enhancing Web Marketing By Using Ontology, Xuan Zhou May 2006

Enhancing Web Marketing By Using Ontology, Xuan Zhou

Dissertations

The existence of the Web has a major impact on people's life styles. Online shopping, online banking, email, instant messenger services, search engines and bulletin boards have gradually become parts of our daily life. All kinds of information can be found on the Web. Web marketing is one of the ways to make use of online information. By extracting demographic information and interest information from the Web, marketing knowledge can be augmented by applying data mining algorithms. Therefore, this knowledge which connects customers to products can be used for marketing purposes and for targeting existing and potential customers. The Web …


Temporal Data Mining In A Dynamic Feature Space, Brent K. Wenerstrom May 2006

Temporal Data Mining In A Dynamic Feature Space, Brent K. Wenerstrom

Theses and Dissertations

Many interesting real-world applications for temporal data mining are hindered by concept drift. One particular form of concept drift is characterized by changes to the underlying feature space. Seemingly little has been done to address this issue. This thesis presents FAE, an incremental ensemble approach to mining data subject to concept drift. FAE achieves better accuracies over four large datasets when compared with a similar incremental learning algorithm.


Detecting Potential Insider Threats Through Email Datamining, James S. Okolica Mar 2006

Detecting Potential Insider Threats Through Email Datamining, James S. Okolica

Theses and Dissertations

No abstract provided.


Text Mining With Exploitation Of User's Background Knowledge : Discovering Novel Association Rules From Text, Xin Chen Jan 2006

Text Mining With Exploitation Of User's Background Knowledge : Discovering Novel Association Rules From Text, Xin Chen

Dissertations

The goal of text mining is to find interesting and non-trivial patterns or knowledge from unstructured documents. Both objective and subjective measures have been proposed in the literature to evaluate the interestingness of discovered patterns. However, objective measures alone are insufficient because such measures do not consider knowledge and interests of the users. Subjective measures require explicit input of user expectations which is difficult or even impossible to obtain in text mining environments.

This study proposes a user-oriented text-mining framework and applies it to the problem of discovering novel association rules from documents. The developed system, uMining, consists of two …


Efficient Generation Of Social Network Data From Computer-Mediated Communication Logs, Jason Wei Sung Yee Mar 2005

Efficient Generation Of Social Network Data From Computer-Mediated Communication Logs, Jason Wei Sung Yee

Theses and Dissertations

The insider threat poses a significant risk to any network or information system. A general definition of the insider threat is an authorized user performing unauthorized actions, a broad definition with no specifications on severity or action. While limited research has been able to classify and detect insider threats, it is generally understood that insider attacks are planned, and that there is a time period in which the organization's leadership can intervene and prevent the attack. Previous studies have shown that the person's behavior will generally change, and it is possible that social network analysis could be used to observe …


Pattern Discovery In Structural Databases With Applications To Bioinformatics, Sen Zhang Jan 2005

Pattern Discovery In Structural Databases With Applications To Bioinformatics, Sen Zhang

Dissertations

Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. In this thesis, two new FSM techniques are proposed for finding patterns in unordered labeled trees. Such trees can be used to model evolutionary histories of different species, among others.

The first FSM technique finds cousin pairs in the trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree T, our …


New Techniques For Improving Biological Data Quality Through Information Integration, Katherine Grace Herbert May 2004

New Techniques For Improving Biological Data Quality Through Information Integration, Katherine Grace Herbert

Dissertations

As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardized, methods and frameworks must be developed to handle both structural and traditional data.

The BIG-AJAX framework has been developed for solving these problems through both data cleaning and data integration. This framework exploits declarative data cleaning and exploratory data mining …


Customer Relationship Management For Banking System, Pingyu Hou Jan 2004

Customer Relationship Management For Banking System, Pingyu Hou

Theses Digitization Project

The purpose of this project is to design, build, and implement a Customer Relationship Management (CRM) system for a bank. CRM BANKING is an online application that caters to strengthening and stabilizing customer relationships in a bank.


High Performance Data Mining Techniques For Intrusion Detection, Muazzam Ahmed Siddiqui Jan 2004

High Performance Data Mining Techniques For Intrusion Detection, Muazzam Ahmed Siddiqui

Electronic Theses and Dissertations

The rapid growth of computers transformed the way in which information and data was stored. With this new paradigm of data access, comes the threat of this information being exposed to unauthorized and unintended users. Many systems have been developed which scrutinize the data for a deviation from the normal behavior of a user or system, or search for a known signature within the data. These systems are termed as Intrusion Detection Systems (IDS). These systems employ different techniques varying from statistical methods to machine learning algorithms. Intrusion detection systems use audit data generated by operating systems, application softwares or …


Using Sequence Analysis To Perform Application-Based Anomaly Detection Within An Artificial Immune System Framework, Larissa A. O'Brien Mar 2003

Using Sequence Analysis To Perform Application-Based Anomaly Detection Within An Artificial Immune System Framework, Larissa A. O'Brien

Theses and Dissertations

The Air Force and other Department of Defense (DoD) computer systems typically rely on traditional signature-based network IDSs to detect various types of attempted or successful attacks. Signature-based methods are limited to detecting known attacks or similar variants; anomaly-based systems, by contrast, alert on behaviors previously unseen. The development of an effective anomaly-detecting, application based IDS would increase the Air Force's ability to ward off attacks that are not detected by signature-based network IDSs, thus strengthening the layered defenses necessary to acquire and maintain safe, secure communication capability. This system follows the Artificial Immune System (AIS) framework, which relies on …


Analysis Of Gene Expression Data Using Expressionist 3.1 And Genespring 4.2, Indu Shrivastava Jan 2003

Analysis Of Gene Expression Data Using Expressionist 3.1 And Genespring 4.2, Indu Shrivastava

Theses

The purpose of this study was to determine the differences in the gene expression analysis methods of two data mining tools, ExpressionisticTM 3.1 and GeneSpringTM 4.2 with focus on basic statistical analysis and clustering algorithms. The data for this analysis was derived from the hybridization of Rattus norvegicus RNA to the Affymetrix RG34A GeneChip. This analysis was derived from experiments designed to identify changes in gene expression patterns that were induced in vivo by an experimental treatment.

The tools were found to be comparable with respect to the list of statistically significant genes that were up-regulated by more …


Data Mining Feature Subset Weighting And Selection Using Genetic Algorithms, Okan Yilmaz Mar 2002

Data Mining Feature Subset Weighting And Selection Using Genetic Algorithms, Okan Yilmaz

Theses and Dissertations

We present a simple genetic algorithm (sGA), which is developed under Genetic Rule and Classifier Construction Environment (GRaCCE) to solve feature subset selection and weighting problem to have better classification accuracy on k-nearest neighborhood (KNN) algorithm. Our hypotheses are that weighting the features will affect the performance of the KNN algorithm and will cause better classification accuracy rate than that of binary classification. The weighted-sGA algorithm uses real-value chromosomes to find the weights for features and binary-sGA uses integer-value chromosomes to select the subset of features from original feature set. A Repair algorithm is developed for weighted-sGA algorithm to guarantee …


A Tool For Phylogenetic Data Cleaning And Searching, Viswanath Neelavalli Jan 2002

A Tool For Phylogenetic Data Cleaning And Searching, Viswanath Neelavalli

Theses

Data collection and cleaning is a very important part of an elaborate Data Mining System. 'TreeBASE' is a relational database of phylogenetic information at the Harvard University with a keyword based searching interface. 'TreeSearch' is a Structure based search engine implemented at NJIT that can be used for searching phylogenetic data. Phylogenetic trees are extracted from the flat-file database at Harvard University, available at {ftp://herbaria.harvard.edu/pub/piel/Data/files/}. There is huge amount of information present in the files about the trees and the data matrices from which the trees are generated. The search tool implemented at NJIT is interested in using the string …


Data Warehouse Applications In Modern Day Business, Carla Mounir Issa Jan 2002

Data Warehouse Applications In Modern Day Business, Carla Mounir Issa

Theses Digitization Project

Data warehousing provides organizations with strategic tools to achieve the competitive advantage that organazations are constantly seeking. The use of tools such as data mining, indexing and summaries enables management to retrieve information and perform thorough analysis, planning and forcasting to meet the changes in the market environment. in addition, The data warehouse is providing security measures that, if properly implemented and planned, are helping organizations ensure that their data quality and validity remain intact.


Knowledge Discovery In Biological Databases : A Neural Network Approach, Qicheng Ma Aug 2000

Knowledge Discovery In Biological Databases : A Neural Network Approach, Qicheng Ma

Dissertations

Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, …