Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Databases and Information Systems

Singapore Management University

Clustering

Articles 1 - 20 of 20

Full-Text Articles in Physical Sciences and Mathematics

Measuring Data Collection Diligence For Community Healthcare, Galawala Ramesha Samurdhi Karunasena, M. S. Ambiya, Arunesh Sinha, R. Nagar, S. Dalal, Abdullah. H., D. Thakkar, D. Narayanan, M. Tambe Oct 2021

Measuring Data Collection Diligence For Community Healthcare, Galawala Ramesha Samurdhi Karunasena, M. S. Ambiya, Arunesh Sinha, R. Nagar, S. Dalal, Abdullah. H., D. Thakkar, D. Narayanan, M. Tambe

Research Collection School Of Computing and Information Systems

Data analytics has tremendous potential to provide targeted benefit in low-resource communities, however the availability of highquality public health data is a significant challenge in developing countries primarily due to non-diligent data collection by community health workers (CHWs). Our use of the word non-diligence here is to emphasize that poor data collection is often not a deliberate action by CHW but arises due to a myriad of factors, sometime beyond the control of the CHW. In this work, we define and test a data collection diligence score. This challenging unlabeled data problem is handled by building upon domain expert’s guidance …


Robust Graph Learning From Noisy Data, Zhao Kang, Haiqi Pan, Steven C. H. Hoi, Zenglin Xu May 2020

Robust Graph Learning From Noisy Data, Zhao Kang, Haiqi Pan, Steven C. H. Hoi, Zenglin Xu

Research Collection School Of Computing and Information Systems

Learning graphs from data automatically have shown encouraging performance on clustering and semisupervised learning tasks. However, real data are often corrupted, which may cause the learned graph to be inexact or unreliable. In this paper, we propose a novel robust graph learning scheme to learn reliable graphs from the real-world noisy data by adaptively removing noise and errors in the raw data. We show that our proposed model can also be viewed as a robust version of manifold regularized robust principle component analysis (RPCA), where the quality of the graph plays a critical role. The proposed model is able to …


Salience-Aware Adaptive Resonance Theory For Large-Scale Sparse Data Clustering, Lei Meng, Ah-Hwee Tan, Chunyan Miao Dec 2019

Salience-Aware Adaptive Resonance Theory For Large-Scale Sparse Data Clustering, Lei Meng, Ah-Hwee Tan, Chunyan Miao

Research Collection School Of Computing and Information Systems

Sparse data is known to pose challenges to cluster analysis, as the similarity between data tends to be ill-posed in the high-dimensional Hilbert space. Solutions in the literature typically extend either k-means or spectral clustering with additional steps on representation learning and/or feature weighting. However, adding these usually introduces new parameters and increases computational cost, thus inevitably lowering the robustness of these algorithms when handling massive ill-represented data. To alleviate these issues, this paper presents a class of self-organizing neural networks, called the salience-aware adaptive resonance theory (SA-ART) model. SA-ART extends Fuzzy ART with measures for cluster-wise salient feature modeling. …


Topicsummary: A Tool For Analyzing Class Discussion Forums Using Topic Based Summarizations, Swapna Gottipati, Venky Shankararaman, Renjini Ramesh Oct 2019

Topicsummary: A Tool For Analyzing Class Discussion Forums Using Topic Based Summarizations, Swapna Gottipati, Venky Shankararaman, Renjini Ramesh

Research Collection School Of Computing and Information Systems

This Innovative Practice full paper, describes the application of text mining techniques for extracting insights from a course based online discussion forum through generation of topic based summaries. Discussions, either in classroom or online provide opportunity for collaborative learning through exchange of ideas that leads to enhanced learning through active participation. Online discussions offer a number of benefits namely providing additional time to reflect and synthesize information before writing, providing a natural platform for students to voice their ideas without any one student dominating the conversation, and providing a record of the student’s thoughts. An online discussion forum provides a …


Redpc: A Residual Error-Based Density Peak Clustering Algorithm, Milan Parmar, Di Wang, Xiaofeng Zhang, Ah-Hwee Tan, Chunyan Miao, You Zhou Jul 2019

Redpc: A Residual Error-Based Density Peak Clustering Algorithm, Milan Parmar, Di Wang, Xiaofeng Zhang, Ah-Hwee Tan, Chunyan Miao, You Zhou

Research Collection School Of Computing and Information Systems

The density peak clustering (DPC) algorithm was designed to identify arbitrary-shaped clusters by finding density peaks in the underlying dataset. Due to its aptitudes of relatively low computational complexity and a small number of control parameters in use, DPC soon became widely adopted. However, because DPC takes the entire data space into consideration during the computation of local density, which is then used to generate a decision graph for the identification of cluster centroids, DPC may face difficulty in differentiating overlapping clusters and in dealing with low-density data points. In this paper, we propose a residual error-based density peak clustering …


Cure: Flexible Categorical Data Representation By Hierarchical Coupling Learning, Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, Hang Gao May 2019

Cure: Flexible Categorical Data Representation By Hierarchical Coupling Learning, Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, Hang Gao

Research Collection School Of Computing and Information Systems

The representation of categorical data with hierarchical value coupling relationships (i.e., various value-to-value cluster interactions) is very critical yet challenging for capturing complex data characteristics in learning tasks. This paper proposes a novel and flexible coupled unsupervised categorical data representation (CURE) framework, which not only captures the hierarchical couplings but is also flexible enough to be instantiated for contrastive learning tasks. CURE first learns the value clusters of different granularities based on multiple value coupling functions and then learns the value representation from the couplings between the obtained value clusters. With two complementary value coupling functions, CURE is instantiated into …


Using Smart Card Data To Model Commuters’ Responses Upon Unexpected Train Delays, Xiancai Tian, Baihua Zheng Dec 2018

Using Smart Card Data To Model Commuters’ Responses Upon Unexpected Train Delays, Xiancai Tian, Baihua Zheng

Research Collection School Of Computing and Information Systems

The mass rapid transit (MRT) network is playing an increasingly important role in Singapore's transit network, thanks to its advantages of higher capacity and faster speed. Unfortunately, due to aging infrastructure, increasing demand, and other reasons like adverse weather condition, commuters in Singapore recently have been facing increasing unexpected train delays (UTDs), which has become a source of frustration for both commuters and operators. Most, if not all, existing works on delay management do not consider commuters' behavior. We dedicate this paper to the study of commuters' behavior during UTDs. We adopt a data-driven approach to analyzing the six-month' real …


Exploiting The Interdependency Of Land Use And Mobility For Urban Planning, Kasthuri Jayarajah, Andrew Tan, Archan Misra Oct 2018

Exploiting The Interdependency Of Land Use And Mobility For Urban Planning, Kasthuri Jayarajah, Andrew Tan, Archan Misra

Research Collection School Of Computing and Information Systems

Urban planners and economists alike have strong interest in understanding the inter-dependency of land use and people flow. The two-pronged problem entails systematic modeling and understanding of how land use impacts crowd flow to an area and in turn, how the influx of people to an area (or lack thereof) can influence the viability of business entities in that area. With cities becoming increasingly sensor-rich, for example, digitized payments for public transportation and constant trajectory tracking of buses and taxis, understanding and modelling crowd flows at the city scale, as well as, at finer granularity such as at the neighborhood …


How Does Developer Interaction Relate To Software Quality? An Examination Of Product Development Data, Subhajit Datta Jun 2018

How Does Developer Interaction Relate To Software Quality? An Examination Of Product Development Data, Subhajit Datta

Research Collection School Of Computing and Information Systems

Industrial software systems are being increasingly developed by large and distributed teams. Tools like collaborative development environments (CDE) are used to facilitate interaction between members of such teams, with the expectation that social factors around the interaction would facilitate team functioning. In this paper, we first identify typically social characteristics of interaction in a software development team: reachability, connection, association, and clustering. We then examine how these factors relate to the quality of software produced by a team, in terms of the number of defects, through an empirical study of 70+ teams, involving 900+ developers in total, spread across 30+ …


A Novel Density Peak Clustering Algorithm Based On Squared Residual Error, Milan Parmar, Di Wang, Ah-Hwee Tan, Chunyan Miao, Jianhua Jiang, You Zhou Dec 2017

A Novel Density Peak Clustering Algorithm Based On Squared Residual Error, Milan Parmar, Di Wang, Ah-Hwee Tan, Chunyan Miao, Jianhua Jiang, You Zhou

Research Collection School Of Computing and Information Systems

The density peak clustering (DPC) algorithm is designed to quickly identify intricate-shaped clusters with high dimensionality by finding high-density peaks in a non-iterative manner and using only one threshold parameter. However, DPC has certain limitations in processing low-density data points because it only takes the global data density distribution into account. As such, DPC may confine in forming low-density data clusters, or in other words, DPC may fail in detecting anomalies and borderline points. In this paper, we analyze the limitations of DPC and propose a novel density peak clustering algorithm to better handle low-density clustering tasks. Specifically, our algorithm …


Adaptive Scaling Of Cluster Boundaries For Large-Scale Social Media Data Clustering, Lei Meng, Ah-Hwee Tan, Donald C. Wunsch Dec 2015

Adaptive Scaling Of Cluster Boundaries For Large-Scale Social Media Data Clustering, Lei Meng, Ah-Hwee Tan, Donald C. Wunsch

Research Collection School Of Computing and Information Systems

The large scale and complex nature of social media data raises the need to scale clustering techniques to big data and make them capable of automatically identifying data clusters with few empirical settings. In this paper, we present our investigation and three algorithms based on the fuzzy adaptive resonance theory (Fuzzy ART) that have linear computational complexity, use a single parameter, i.e., the vigilance parameter to identify data clusters, and are robust to modest parameter settings. The contribution of this paper lies in two aspects. First, we theoretically demonstrate how complement coding, commonly known as a normalization method, changes the …


Online Multimodal Co-Indexing And Retrieval Of Weakly Labeled Web Image Collections, Lei Meng, Ah-Hwee Tan, Cyril Leung, Liqiang Nie, Tan-Seng Chua, Chunyan Miao Jun 2015

Online Multimodal Co-Indexing And Retrieval Of Weakly Labeled Web Image Collections, Lei Meng, Ah-Hwee Tan, Cyril Leung, Liqiang Nie, Tan-Seng Chua, Chunyan Miao

Research Collection School Of Computing and Information Systems

Weak supervisory information of web images, such as captions, tags, and descriptions, make it possible to better understand images at the semantic level. In this paper, we propose a novel online multimodal co-indexing algorithm based on Adaptive Resonance Theory, named OMC-ART, for the automatic co-indexing and retrieval of images using their multimodal information. Compared with existing studies, OMC-ART has several distinct characteristics. First, OMCART is able to perform online learning of sequential data. Second, OMC-ART builds a two-layer indexing structure, in which the first layer co-indexes the images by the key visual and textual features based on the generalized distributions …


Dynamic Clustering Of Contextual Multi-Armed Bandits, Trong T. Nguyen, Hady W. Lauw Nov 2014

Dynamic Clustering Of Contextual Multi-Armed Bandits, Trong T. Nguyen, Hady W. Lauw

Research Collection School Of Computing and Information Systems

With the prevalence of the Web and social media, users increasingly express their preferences online. In learning these preferences, recommender systems need to balance the trade-off between exploitation, by providing users with more of the "same", and exploration, by providing users with something "new" so as to expand the systems' knowledge. Multi-armed bandit (MAB) is a framework to balance this trade-off. Most of the previous work in MAB either models a single bandit for the whole population, or one bandit for each user. We propose an algorithm to divide the population of users into multiple clusters, and to customize the …


Scalable Visual Instance Mining With Threads Of Features, Wei Zhang, Hongzhi Li, Chong-Wah Ngo, Shih-Fu Chang Nov 2014

Scalable Visual Instance Mining With Threads Of Features, Wei Zhang, Hongzhi Li, Chong-Wah Ngo, Shih-Fu Chang

Research Collection School Of Computing and Information Systems

We address the problem of visual instance mining, which is to extract frequently appearing visual instances automatically from a multimedia collection. We propose a scalable mining method by exploiting Thread of Features (ToF). Specifically, ToF, a compact representation that links consistent features across images, is extracted to reduce noises, discover patterns, and speed up processing. Various instances, especially small ones, can be discovered by exploiting correlated ToFs. Our approach is significantly more effective than other methods in mining small instances. At the same time, it is also more efficient by requiring much fewer hash tables. We compared with several state-of-the-art …


Extracting And Normalizing Entity-Actions From Users' Comments, Swapna Gottipati, Jing Jiang Dec 2012

Extracting And Normalizing Entity-Actions From Users' Comments, Swapna Gottipati, Jing Jiang

Research Collection School Of Computing and Information Systems

With the growing popularity of opinion-rich resources on the Web, new opportunities and challenges arise and aid people in actively using such information to understand the opinions of others. Opinion mining process currently focuses on extracting the sentiments of the users on products, social, political and economical issues. In many instances, users not only express their sentiments but also contribute their ideas, requests and suggestions through comments. Such comments are useful for domain experts and are referred to as actionable content. Extracting actionable knowledge from online social media has attracted a growing interest from both academia and the industry. We …


A Generalized Cluster Centroid Based Classifier For Text Categorization, Guansong Pang, Shengyi Jiang Nov 2012

A Generalized Cluster Centroid Based Classifier For Text Categorization, Guansong Pang, Shengyi Jiang

Research Collection School Of Computing and Information Systems

In this paper, a Generalized Cluster Centroid based Classifier (GCCC) and its variants for text categorization are proposed by utilizing a clustering algorithm to integrate two wellknown classifiers, i.e., the K-nearest-neighbor (KNN) classifier and the Rocchio classifier. KNN, a lazy learning method, suffers from inefficiency in online categorization while achieving remarkable effectiveness. Rocchio, which has efficient categorization performance, fails to obtain an expressive categorization model due to its inherent linear separability assumption. Our proposed method mainly focuses on two points: one point is that we use a clustering algorithm to strengthen the expressiveness of the Rocchio model; another one is …


The Social Network Of Software Engineering Research, Subhajit Datta, Nishant Kumar, Santonu Sarkar Feb 2012

The Social Network Of Software Engineering Research, Subhajit Datta, Nishant Kumar, Santonu Sarkar

Research Collection School Of Computing and Information Systems

The social network perspective has served as a useful framework for studying scientific research collaboration in different disciplines. Although collaboration in computer science research has received some attention, software engineering research collaboration has remained unexplored to a large extent. In this paper, we examine the collaboration networks based on co-authorship information of papers from ten software engineering publication venues over the 1976-2010 time period. We compare time variations of certain parameters of these networks with corresponding parameters of collaboration networks from other disciplines. We also explore whether software engineering collaboration networks manifest symptoms of the small-world phenomenon, conform to the …


Multi-Order Neurons For Evolutionary Higher Order Clustering And Growth, Kiruthika Ramanathan, Sheng Uei Guan Dec 2007

Multi-Order Neurons For Evolutionary Higher Order Clustering And Growth, Kiruthika Ramanathan, Sheng Uei Guan

Research Collection School Of Computing and Information Systems

This letter proposes to use multiorder neurons for clustering irregularly shaped data arrangements. Multiorder neurons are an evolutionary extension of the use of higher-order neurons in clustering. Higher-order neurons parametrically model complex neuron shapes by replacing the classic synaptic weight by higher-order tensors. The multiorder neuron goes one step further and eliminates two problems associated with higher-order neurons. First, it uses evolutionary algorithms to select the best neuron order for a given problem. Second, it obtains more information about the underlying data distribution by identifying the correct order for a given cluster of patterns. Empirically we observed that when the …


Towards Personalised Web Intelligence, Ah-Hwee Tan, Hwee-Leng Ong, Hong Pan, Jamie Ng, Qiu-Xiang Li Sep 2004

Towards Personalised Web Intelligence, Ah-Hwee Tan, Hwee-Leng Ong, Hong Pan, Jamie Ng, Qiu-Xiang Li

Research Collection School Of Computing and Information Systems

The Flexible Organizer for Competitive Intelligence (FOCI) is a personalised web intelligence system that provides an integrated platform for gathering, organising, tracking, and disseminating competitive information on the web. FOCI builds personalised information portfolios through a novel method called User-Configurable Clustering, which allows a user to personalise his/her portfolios in terms of the content as well as the organisational structure. This paper outlines the key challenges we face in personalised information management and gives a detailed account of FOCI’s underlying personalisation mechanism. For a quantitative evaluation of the system’s performance, we propose a set of performance indices based on information …


Modified Art 2a Growing Network Capable Of Generating A Fixed Number Of Nodes, Ji He, Ah-Hwee Tan, Chew-Lim Tan May 2004

Modified Art 2a Growing Network Capable Of Generating A Fixed Number Of Nodes, Ji He, Ah-Hwee Tan, Chew-Lim Tan

Research Collection School Of Computing and Information Systems

This paper introduces the Adaptive Resonance Theory under Constraint (ART-C 2A) learning paradigm based on ART 2A, which is capable of generating a user-defined number of recognition nodes through online estimation of an appropriate vigilance threshold. Empirical experiments compare the cluster validity and the learning efficiency of ART-C 2A with those of ART 2A, as well as three closely related clustering methods, namely online K-Means, batch K-Means, and SOM, in a quantitative manner. Besides retaining the online cluster creation capability of ART 2A, ART-C 2A gives the alternative clustering solution, which allows a direct control on the number of output …