Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Science Faculty Publications

Series

Data mining

Articles 1 - 3 of 3

Full-Text Articles in Computer Sciences

Osfs-Vague: Online Streaming Feature Selection Algorithm Based On A Vague Set, Jie Yang, Zhijun Wang, Guoyin Wang, Yanmin Liu, Yi He, Di Wu Jan 2024

Osfs-Vague: Online Streaming Feature Selection Algorithm Based On A Vague Set, Jie Yang, Zhijun Wang, Guoyin Wang, Yanmin Liu, Yi He, Di Wu

Computer Science Faculty Publications

Online streaming feature selection (OSFS), as an online learning manner to handle streaming features, is critical in addressing high-dimensional data. In real big data-related applications, the patterns and distributions of streaming features constantly change over time due to dynamic data generation environments. However, existing OSFS methods rely on presented and fixed hyperparameters, which undoubtedly lead to poor selection performance when encountering dynamic features. To make up for the existing shortcomings, the authors propose a novel OSFS algorithm based on vague set, named OSFS-Vague. Its main idea is to combine uncertainty and three-way decision theories to improve feature selection from the …


A Pseudo Nearest-Neighbor Approach For Missing Data Recovery On Gaussian Random Data Sets, Xiaolu Huang, Qiuming Zhu Nov 2002

A Pseudo Nearest-Neighbor Approach For Missing Data Recovery On Gaussian Random Data Sets, Xiaolu Huang, Qiuming Zhu

Computer Science Faculty Publications

Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with …


An Iterative Initial-Points Refinement Algorithm For Categorical Data Clustering, Ying Sun, Qiuming Zhu, Zhengxin Chen May 2002

An Iterative Initial-Points Refinement Algorithm For Categorical Data Clustering, Ying Sun, Qiuming Zhu, Zhengxin Chen

Computer Science Faculty Publications

The original k-means clustering algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being directly applied to categorical data clustering in many data mining applications. The k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference. World Scientific, Singapore, 1997, pp. 21–34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is …