Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

Discipline
Institution
Publication Year
Publication
Publication Type
File Type

Articles 301 - 329 of 329

Full-Text Articles in Computer Sciences

On The Optimization Of Visualizations Of Complex Phenomena, Donald H. House, Althea D. Bair, Colin Ware Jan 2005

On The Optimization Of Visualizations Of Complex Phenomena, Donald H. House, Althea D. Bair, Colin Ware

Center for Coastal and Ocean Mapping

The problem of perceptually optimizing complex visualizations is a difficult one, involving perceptual as well as aesthetic issues. In our experience, controlled experiments are quite limited in their ability to uncover interrelationships among visualization parameters, and thus may not be the most useful way to develop rules-of-thumb or theory to guide the production of high-quality visualizations. In this paper, we propose a new experimental approach to optimizing visualization quality that integrates some of the strong points of controlled experiments with methods more suited to investigating complex highly-coupled phenomena. We use human-in-the-loop experiments to search through visualization parameter space, generating large …


Blocking Reduction Strategies In Hierarchical Text Classification, Ee Peng Lim, Aixin Sun, Wee-Keong Ng, Jaideep Srivastava Oct 2004

Blocking Reduction Strategies In Hierarchical Text Classification, Ee Peng Lim, Aixin Sun, Wee-Keong Ng, Jaideep Srivastava

Research Collection School Of Computing and Information Systems

One common approach in hierarchical text classification involves associating classifiers with nodes in the category tree and classifying text documents in a top-down manner. Classification methods using this top-down approach can scale well and cope with changes to the category trees. However, all these methods suffer from blocking which refers to documents wrongly rejected by the classifiers at higher-levels and cannot be passed to the classifiers at lower-levels. We propose a classifier-centric performance measure known as blocking factor to determine the extent of the blocking. Three methods are proposed to address the blocking problem, namely, threshold reduction, restricted voting, and …


Enhancements To Crisp Possibilistic Reconstructability Analysis, Anas Al-Rabadi, Martin Zwick Aug 2004

Enhancements To Crisp Possibilistic Reconstructability Analysis, Anas Al-Rabadi, Martin Zwick

Systems Science Faculty Publications and Presentations

Modified Reconstructibility Analysis (MRA), a novel decomposition within the framework of set-theoretic (crisp possibilistic) Reconstructibility Analysis, is presented. It is shown that in some cases while 3-variable NPN-classified Boolean functions are not decomposable using Conventional Reconstructibility Analysis (CRA), they are decomposable using Modified Reconstructibility Analysis (MRA). Also, it is shown that whenever a decomposition of 3-variable NPN-classified Boolean functions exists in both MRA and CRA, MRA yields simpler or equal complexity decompositions. A comparison of the corresponding complexities for Ashenhurst-Curtis decompositions, and Modified Reconstructibility Analysis (MRA) is also presented. While both AC and MRA decompose some but …


A Support-Ordered Trie For Fast Frequent Itemset Discovery, Ee Peng Lim, Yew-Kwong Woon, Wee-Keong Ng Jul 2004

A Support-Ordered Trie For Fast Frequent Itemset Discovery, Ee Peng Lim, Yew-Kwong Woon, Wee-Keong Ng

Research Collection School Of Computing and Information Systems

The importance of data mining is apparent with the advent of powerful data collection and storage tools; raw data is so abundant that manual analysis is no longer possible. Unfortunately, data mining problems are difficult to solve and this prompted the introduction of several novel data structures to improve mining efficiency. Here, we critically examine existing preprocessing data structures used in association rule mining for enhancing performance in an attempt to understand their strengths and weaknesses. Our analyses culminate in a practical structure called the SOTrielT (support-ordered trie itemset) and two synergistic algorithms to accompany it for the fast discovery …


New Techniques For Improving Biological Data Quality Through Information Integration, Katherine Grace Herbert May 2004

New Techniques For Improving Biological Data Quality Through Information Integration, Katherine Grace Herbert

Dissertations

As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardized, methods and frameworks must be developed to handle both structural and traditional data.

The BIG-AJAX framework has been developed for solving these problems through both data cleaning and data integration. This framework exploits declarative data cleaning and exploratory data mining …


Customer Relationship Management For Banking System, Pingyu Hou Jan 2004

Customer Relationship Management For Banking System, Pingyu Hou

Theses Digitization Project

The purpose of this project is to design, build, and implement a Customer Relationship Management (CRM) system for a bank. CRM BANKING is an online application that caters to strengthening and stabilizing customer relationships in a bank.


High Performance Data Mining Techniques For Intrusion Detection, Muazzam Ahmed Siddiqui Jan 2004

High Performance Data Mining Techniques For Intrusion Detection, Muazzam Ahmed Siddiqui

Electronic Theses and Dissertations

The rapid growth of computers transformed the way in which information and data was stored. With this new paradigm of data access, comes the threat of this information being exposed to unauthorized and unintended users. Many systems have been developed which scrutinize the data for a deviation from the normal behavior of a user or system, or search for a known signature within the data. These systems are termed as Intrusion Detection Systems (IDS). These systems employ different techniques varying from statistical methods to machine learning algorithms. Intrusion detection systems use audit data generated by operating systems, application softwares or …


Reconstructability Analysis With Fourier Transforms, Martin Zwick Jan 2004

Reconstructability Analysis With Fourier Transforms, Martin Zwick

Systems Science Faculty Publications and Presentations

Fourier methods used in two‐ and three‐dimensional image reconstruction can be used also in reconstructability analysis (RA). These methods maximize a variance‐type measure instead of information‐theoretic uncertainty, but the two measures are roughly collinear and the Fourier approach yields results close to that of standard RA. The Fourier method, however, does not require iterative calculations for models with loops. Moreover, the error in Fourier RA models can be assessed without actually generating the full probability distributions of the models; calculations scale with the size of the data rather than the state space. State‐based modeling using the Fourier approach is also …


A Software Architecture For Reconstructability Analysis, Kenneth Willett, Martin Zwick Jan 2004

A Software Architecture For Reconstructability Analysis, Kenneth Willett, Martin Zwick

Systems Science Faculty Publications and Presentations

Software packages for reconstructability analysis (RA), as well as for related log linear modeling, generally provide a fixed set of functions. Such packages are suitable for end‐users applying RA in various domains, but do not provide a platform for research into the RA methods themselves. A new software system, Occam3, is being developed which is intended to address three goals which often conflict with one another to provide: a general and flexible infrastructure for experimentation with RA methods and algorithms; an easily‐configured system allowing methods to be combined in novel ways, without requiring deep software expertise; and a system which …


An Overview Of Reconstructability Analysis, Martin Zwick Jan 2004

An Overview Of Reconstructability Analysis, Martin Zwick

Systems Science Faculty Publications and Presentations

This paper is an overview of reconstructability analysis (RA), a discrete multivariate modeling methodology developed in the systems literature; an earlier version of this tutorial is Zwick (2001). RA was derived from Ashby (1964), and was developed by Broekstra, Cavallo, Cellier Conant, Jones, Klir, Krippendorff, and others (Klir, 1986, 1996). RA resembles and partially overlaps log‐line (LL) statistical methods used in the social sciences (Bishop et al., 1978; Knoke and Burke, 1980). RA also resembles and overlaps methods used in logic design and machine learning (LDL) in electrical and computer engineering (e.g. Perkowski et al., 1997). Applications of RA, like …


A Comparison Of Modified Reconstructability Analysis And Ashenhurst‐Curtis Decomposition Of Boolean Functions, Anas Al-Rabadi, Marek Perkowski, Martin Zwick Jan 2004

A Comparison Of Modified Reconstructability Analysis And Ashenhurst‐Curtis Decomposition Of Boolean Functions, Anas Al-Rabadi, Marek Perkowski, Martin Zwick

Systems Science Faculty Publications and Presentations

Modified reconstructability analysis (MRA), a novel decomposition technique within the framework of set‐theoretic (crisp possibilistic) reconstructability analysis, is applied to three‐variable NPN‐classified Boolean functions. MRA is superior to conventional reconstructability analysis, i.e. it decomposes more NPN functions. MRA is compared to Ashenhurst‐Curtis (AC) decomposition using two different complexity measures: log‐functionality, a measure suitable for machine learning, and the count of the total number of two‐input gates, a measure suitable for circuit design. MRA is superior to AC using the first of these measures, and is comparable to, but different from AC, using the second.


Modified Reconstructability Analysis For Many-Valued Functions And Relations, Anas Al-Rabadi, Martin Zwick Jan 2004

Modified Reconstructability Analysis For Many-Valued Functions And Relations, Anas Al-Rabadi, Martin Zwick

Systems Science Faculty Publications and Presentations

A novel many-valued decomposition within the framework of lossless Reconstructability Analysis is presented. In previous work, Modified Recontructability Analysis (MRA) was applied to Boolean functions, where it was shown that most Boolean functions not decomposable using conventional Reconstructability Analysis (CRA) are decomposable using MRA. Also, it was previously shown that whenever decomposition exists in both MRA and CRA, MRA yields simpler or equal complexity decompositions. In this paper, MRA is extended to many-valued logic functions, and logic structures that correspond to such decomposition are developed. It is shown that many-valued MRA can decompose many-valued functions when CRA fails to do …


Directed Extended Dependency Analysis For Data Mining, Thaddeus T. Shannon, Martin Zwick Jan 2004

Directed Extended Dependency Analysis For Data Mining, Thaddeus T. Shannon, Martin Zwick

Systems Science Faculty Publications and Presentations

Extended dependency analysis (EDA) is a heuristic search technique for finding significant relationships between nominal variables in large data sets. The directed version of EDA searches for maximally predictive sets of independent variables with respect to a target dependent variable. The original implementation of EDA was an extension of reconstructability analysis. Our new implementation adds a variety of statistical significance tests at each decision point that allow the user to tailor the algorithm to a particular objective. It also utilizes data structures appropriate for the sparse data sets customary in contemporary data mining problems. Two examples that illustrate different approaches …


State-Based Reconstructability Analysis, Martin Zwick, Michael S. Johnson Jan 2004

State-Based Reconstructability Analysis, Martin Zwick, Michael S. Johnson

Systems Science Faculty Publications and Presentations

Reconstructability analysis (RA) is a method for detecting and analyzing the structure of multivariate categorical data. While Jones and his colleagues extended the original variable‐based formulation of RA to encompass models defined in terms of system states, their focus was the analysis and approximation of real‐valued functions. In this paper, we separate two ideas that Jones had merged together: the “g to k” transformation and state‐based modeling. We relate the idea of state‐based modeling to established variable‐based RA concepts and methods, including structure lattices, search strategies, metrics of model quality, and the statistical evaluation of model fit for analyses based …


Reconstructability Analysis Detection Of Optimal Gene Order In Genetic Algorithms, Martin Zwick, Stephen Shervais Jan 2004

Reconstructability Analysis Detection Of Optimal Gene Order In Genetic Algorithms, Martin Zwick, Stephen Shervais

Systems Science Faculty Publications and Presentations

The building block hypothesis implies that genetic algorithm efficiency will be improved if sets of genes that improve fitness through epistatic interaction are near to one another on the chromosome. We demonstrate this effect with a simple problem, and show that information-theoretic reconstructability analysis can be used to decide on optimal gene ordering.


Reversible Modified Reconstructability Analysis Of Boolean Circuits And Its Quantum Computation, Anas Al-Rabadi, Martin Zwick Jan 2004

Reversible Modified Reconstructability Analysis Of Boolean Circuits And Its Quantum Computation, Anas Al-Rabadi, Martin Zwick

Systems Science Faculty Publications and Presentations

Modified Reconstructability Analysis (MRA) can be realized reversibly by utilizing Boolean reversible (3,3) logic gates that are universal in two arguments. The quantum computation of the reversible MRA circuits is also introduced. The reversible MRA transformations are given a quantum form by using the normal matrix representation of such gates. The MRA-based quantum decomposition may play an important role in the synthesis of logic structures using future technologies that consume less power and occupy less space.


Using Reconstructability Analysis To Select Input Variables For Artificial Neural Networks, Stephen Shervais, Martin Zwick Jul 2003

Using Reconstructability Analysis To Select Input Variables For Artificial Neural Networks, Stephen Shervais, Martin Zwick

Systems Science Faculty Publications and Presentations

We demonstrate the use of Reconstructability Analysis to reduce the number of input variables for a neural network. Using the heart disease dataset we reduce the number of independent variables from 13 to two, while providing results that are statistically indistinguishable from those of NNs using the full variable set. We also demonstrate that rule lookup tables obtained directly from the data for the RA models are almost as effective as NNs trained on model variables.


Genescene: Biomedical Text And Data Mining, Gondy Leroy, Hsinchun Chen, Jesse D. Martinez, Shauna Eggers, Ryan R. Falsey, Kerri L. Kislin, Zan Huang, Jiexun Li, Jie Xu, Daniel M. Mcdonald, Gavin Ng May 2003

Genescene: Biomedical Text And Data Mining, Gondy Leroy, Hsinchun Chen, Jesse D. Martinez, Shauna Eggers, Ryan R. Falsey, Kerri L. Kislin, Zan Huang, Jiexun Li, Jie Xu, Daniel M. Mcdonald, Gavin Ng

CGU Faculty Publications and Research

To access the content of digital texts efficiently, it is necessary to provide more sophisticated access than keyword based searching. GeneScene provides biomedical researchers with research findings and background relations automatically extracted from text and experimental data. These provide a more detailed overview of the information available. The extracted relations were evaluated by qualified researchers and are precise. A qualitative ongoing evaluation of the current online interface indicates that this method to search the literature is more useful and efficient than keyword based searching.


Using Sequence Analysis To Perform Application-Based Anomaly Detection Within An Artificial Immune System Framework, Larissa A. O'Brien Mar 2003

Using Sequence Analysis To Perform Application-Based Anomaly Detection Within An Artificial Immune System Framework, Larissa A. O'Brien

Theses and Dissertations

The Air Force and other Department of Defense (DoD) computer systems typically rely on traditional signature-based network IDSs to detect various types of attempted or successful attacks. Signature-based methods are limited to detecting known attacks or similar variants; anomaly-based systems, by contrast, alert on behaviors previously unseen. The development of an effective anomaly-detecting, application based IDS would increase the Air Force's ability to ward off attacks that are not detected by signature-based network IDSs, thus strengthening the layered defenses necessary to acquire and maintain safe, secure communication capability. This system follows the Artificial Immune System (AIS) framework, which relies on …


Analysis Of Gene Expression Data Using Expressionist 3.1 And Genespring 4.2, Indu Shrivastava Jan 2003

Analysis Of Gene Expression Data Using Expressionist 3.1 And Genespring 4.2, Indu Shrivastava

Theses

The purpose of this study was to determine the differences in the gene expression analysis methods of two data mining tools, ExpressionisticTM 3.1 and GeneSpringTM 4.2 with focus on basic statistical analysis and clustering algorithms. The data for this analysis was derived from the hybridization of Rattus norvegicus RNA to the Affymetrix RG34A GeneChip. This analysis was derived from experiments designed to identify changes in gene expression patterns that were induced in vivo by an experimental treatment.

The tools were found to be comparable with respect to the list of statistically significant genes that were up-regulated by more …


A Pseudo Nearest-Neighbor Approach For Missing Data Recovery On Gaussian Random Data Sets, Xiaolu Huang, Qiuming Zhu Nov 2002

A Pseudo Nearest-Neighbor Approach For Missing Data Recovery On Gaussian Random Data Sets, Xiaolu Huang, Qiuming Zhu

Computer Science Faculty Publications

Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with …


An Iterative Initial-Points Refinement Algorithm For Categorical Data Clustering, Ying Sun, Qiuming Zhu, Zhengxin Chen May 2002

An Iterative Initial-Points Refinement Algorithm For Categorical Data Clustering, Ying Sun, Qiuming Zhu, Zhengxin Chen

Computer Science Faculty Publications

The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21–34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is …


Data Mining Feature Subset Weighting And Selection Using Genetic Algorithms, Okan Yilmaz Mar 2002

Data Mining Feature Subset Weighting And Selection Using Genetic Algorithms, Okan Yilmaz

Theses and Dissertations

We present a simple genetic algorithm (sGA), which is developed under Genetic Rule and Classifier Construction Environment (GRaCCE) to solve feature subset selection and weighting problem to have better classification accuracy on k-nearest neighborhood (KNN) algorithm. Our hypotheses are that weighting the features will affect the performance of the KNN algorithm and will cause better classification accuracy rate than that of binary classification. The weighted-sGA algorithm uses real-value chromosomes to find the weights for features and binary-sGA uses integer-value chromosomes to select the subset of features from original feature set. A Repair algorithm is developed for weighted-sGA algorithm to guarantee …


A Tool For Phylogenetic Data Cleaning And Searching, Viswanath Neelavalli Jan 2002

A Tool For Phylogenetic Data Cleaning And Searching, Viswanath Neelavalli

Theses

Data collection and cleaning is a very important part of an elaborate Data Mining System. 'TreeBASE' is a relational database of phylogenetic information at the Harvard University with a keyword based searching interface. 'TreeSearch' is a Structure based search engine implemented at NJIT that can be used for searching phylogenetic data. Phylogenetic trees are extracted from the flat-file database at Harvard University, available at {ftp://herbaria.harvard.edu/pub/piel/Data/files/}. There is huge amount of information present in the files about the trees and the data matrices from which the trees are generated. The search tool implemented at NJIT is interested in using the string …


Studying The Functional Genomics Of Stress Responses In Loblolly Pine With The Expresso Microarray Experiment Management System, Lenwood S. Heath, Naren Ramakrishnan, Ronald R. Sederoff, Ross W. Whetten, Boris I. Chevone, Craig Struble, Vincent Y. Jouenne, Dawei Chen, Leonel Van Zyl, Ruth Grene Jan 2002

Studying The Functional Genomics Of Stress Responses In Loblolly Pine With The Expresso Microarray Experiment Management System, Lenwood S. Heath, Naren Ramakrishnan, Ronald R. Sederoff, Ross W. Whetten, Boris I. Chevone, Craig Struble, Vincent Y. Jouenne, Dawei Chen, Leonel Van Zyl, Ruth Grene

Mathematics, Statistics and Computer Science Faculty Research and Publications

Conception, design, and implementation of cDNA microarray experiments present a variety of bioinformatics challenges for biologists and computational scientists. The multiple stages of data acquisition and analysis have motivated the design of Expresso, a system for microarray experiment management. Salient aspects of Expresso include support for clone replication and randomized placement; automatic gridding, extraction of expression data from each spot, and quality monitoring; flexible methods of combining data from individual spots into information about clones and functional categories; and the use of inductive logic programming for higher-level data analysis and mining. The development of Expresso is occurring in parallel with …


Data Warehouse Applications In Modern Day Business, Carla Mounir Issa Jan 2002

Data Warehouse Applications In Modern Day Business, Carla Mounir Issa

Theses Digitization Project

Data warehousing provides organizations with strategic tools to achieve the competitive advantage that organazations are constantly seeking. The use of tools such as data mining, indexing and summaries enables management to retrieve information and perform thorough analysis, planning and forcasting to meet the changes in the market environment. in addition, The data warehouse is providing security measures that, if properly implemented and planned, are helping organizations ensure that their data quality and validity remain intact.


Predictive Self-Organizing Networks For Text Categorization, Ah-Hwee Tan Apr 2001

Predictive Self-Organizing Networks For Text Categorization, Ah-Hwee Tan

Research Collection School Of Computing and Information Systems

This paper introduces a class of predictive self-organizing neural networks known as Adaptive Resonance Associative Map (ARAM) for classification of free-text documents. Whereas most sta- tistical approaches to text categorization derive classification knowledge based on training examples alone, ARAM performs supervised learn- ing and integrates user-defined classification knowledge in the form of IF-THEN rules. Through our experiments on the Reuters-21578 news database, we showed that ARAM performed reasonably well in mining categorization knowledge from sparse and high dimensional document feature space. In addition, ARAM predictive accuracy and learning efficiency can be improved by incorporating a set of rules derived from …


Knowledge Discovery In Biological Databases : A Neural Network Approach, Qicheng Ma Aug 2000

Knowledge Discovery In Biological Databases : A Neural Network Approach, Qicheng Ma

Dissertations

Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, …


Clouds: A Decision Tree Classifier For Large Datasets, Khaled Alsabti, Sanjay Ranka, Vineet Singh Jan 1998

Clouds: A Decision Tree Classifier For Large Datasets, Khaled Alsabti, Sanjay Ranka, Vineet Singh

Electrical Engineering and Computer Science - All Scholarship

Classification for very large datasets has many practical applications in data mining. Techniques such as discretization and dataset sampling can be used to scale up decision tree classifiers to large datasets. Unfortunately, both of these techniques can cause a significant loss in accuracy. We present a novel decision tree classifier called CLOUDS, which samples the splitting points for numeric attributes followed by an estimation step to narrow the search space of the best split. CLOUDS reduces computation and I/O complexity substantially compared to state of the art classifiers, while maintaining the quality of the generated trees in terms of accuracy …