Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 6 of 6

Full-Text Articles in Physical Sciences and Mathematics

A Pseudo Nearest-Neighbor Approach For Missing Data Recovery On Gaussian Random Data Sets, Xiaolu Huang, Qiuming Zhu Nov 2002

A Pseudo Nearest-Neighbor Approach For Missing Data Recovery On Gaussian Random Data Sets, Xiaolu Huang, Qiuming Zhu

Computer Science Faculty Publications

Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with …


An Iterative Initial-Points Refinement Algorithm For Categorical Data Clustering, Ying Sun, Qiuming Zhu, Zhengxin Chen May 2002

An Iterative Initial-Points Refinement Algorithm For Categorical Data Clustering, Ying Sun, Qiuming Zhu, Zhengxin Chen

Computer Science Faculty Publications

The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21–34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is …


Data Mining Feature Subset Weighting And Selection Using Genetic Algorithms, Okan Yilmaz Mar 2002

Data Mining Feature Subset Weighting And Selection Using Genetic Algorithms, Okan Yilmaz

Theses and Dissertations

We present a simple genetic algorithm (sGA), which is developed under Genetic Rule and Classifier Construction Environment (GRaCCE) to solve feature subset selection and weighting problem to have better classification accuracy on k-nearest neighborhood (KNN) algorithm. Our hypotheses are that weighting the features will affect the performance of the KNN algorithm and will cause better classification accuracy rate than that of binary classification. The weighted-sGA algorithm uses real-value chromosomes to find the weights for features and binary-sGA uses integer-value chromosomes to select the subset of features from original feature set. A Repair algorithm is developed for weighted-sGA algorithm to guarantee …


A Tool For Phylogenetic Data Cleaning And Searching, Viswanath Neelavalli Jan 2002

A Tool For Phylogenetic Data Cleaning And Searching, Viswanath Neelavalli

Theses

Data collection and cleaning is a very important part of an elaborate Data Mining System. 'TreeBASE' is a relational database of phylogenetic information at the Harvard University with a keyword based searching interface. 'TreeSearch' is a Structure based search engine implemented at NJIT that can be used for searching phylogenetic data. Phylogenetic trees are extracted from the flat-file database at Harvard University, available at {ftp://herbaria.harvard.edu/pub/piel/Data/files/}. There is huge amount of information present in the files about the trees and the data matrices from which the trees are generated. The search tool implemented at NJIT is interested in using the string …


Studying The Functional Genomics Of Stress Responses In Loblolly Pine With The Expresso Microarray Experiment Management System, Lenwood S. Heath, Naren Ramakrishnan, Ronald R. Sederoff, Ross W. Whetten, Boris I. Chevone, Craig Struble, Vincent Y. Jouenne, Dawei Chen, Leonel Van Zyl, Ruth Grene Jan 2002

Studying The Functional Genomics Of Stress Responses In Loblolly Pine With The Expresso Microarray Experiment Management System, Lenwood S. Heath, Naren Ramakrishnan, Ronald R. Sederoff, Ross W. Whetten, Boris I. Chevone, Craig Struble, Vincent Y. Jouenne, Dawei Chen, Leonel Van Zyl, Ruth Grene

Mathematics, Statistics and Computer Science Faculty Research and Publications

Conception, design, and implementation of cDNA microarray experiments present a variety of bioinformatics challenges for biologists and computational scientists. The multiple stages of data acquisition and analysis have motivated the design of Expresso, a system for microarray experiment management. Salient aspects of Expresso include support for clone replication and randomized placement; automatic gridding, extraction of expression data from each spot, and quality monitoring; flexible methods of combining data from individual spots into information about clones and functional categories; and the use of inductive logic programming for higher-level data analysis and mining. The development of Expresso is occurring in parallel with …


Data Warehouse Applications In Modern Day Business, Carla Mounir Issa Jan 2002

Data Warehouse Applications In Modern Day Business, Carla Mounir Issa

Theses Digitization Project

Data warehousing provides organizations with strategic tools to achieve the competitive advantage that organazations are constantly seeking. The use of tools such as data mining, indexing and summaries enables management to retrieve information and perform thorough analysis, planning and forcasting to meet the changes in the market environment. in addition, The data warehouse is providing security measures that, if properly implemented and planned, are helping organizations ensure that their data quality and validity remain intact.