Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

2018

Discipline
Institution
Publication
Publication Type

Articles 1 - 24 of 24

Full-Text Articles in Computer Sciences

Data Mining Approach To The Detection Of Suicide In Social Media: A Case Study Of Singapore, Jane H. K. Seah, Kyong Jin Shim Dec 2018

Data Mining Approach To The Detection Of Suicide In Social Media: A Case Study Of Singapore, Jane H. K. Seah, Kyong Jin Shim

Research Collection School Of Computing and Information Systems

In this research, we focus on the social phenomenon of suicide. Specifically, we perform social sensing on digital traces obtained from Reddit. We analyze the posts and comments in that are related to depression and suicide. We perform natural language processing to better understand different aspects of human life that relate to suicide.


Multidimensional Feature Engineering For Post-Translational Modification Prediction Problems, Norman Mapes Jr. Nov 2018

Multidimensional Feature Engineering For Post-Translational Modification Prediction Problems, Norman Mapes Jr.

Doctoral Dissertations

Protein sequence data has been produced at an astounding speed. This creates an opportunity to characterize these proteins for the treatment of illness. A crucial characterization of proteins is their post translational modifications (PTM). There are 20 amino acids coded by DNA after coding (translation) nearly every protein is modified at an amino acid level. We focus on three specific PTMs. First is the bonding formed between two cysteine amino acids, thus introducing a loop to the straight chain of a protein. Second, we predict which cysteines can generally be modified (oxidized). Finally, we predict which lysine amino acids are …


Enhancing Value-Based Healthcare With Reconstructability Analysis: Predicting Cost Of Care In Total Hip Replacement, Cecily Corrine Froemke, Martin Zwick Nov 2018

Enhancing Value-Based Healthcare With Reconstructability Analysis: Predicting Cost Of Care In Total Hip Replacement, Cecily Corrine Froemke, Martin Zwick

Systems Science Faculty Publications and Presentations

Legislative reforms aimed at slowing growth of US healthcare costs are focused on achieving greater value per dollar. To increase value healthcare providers must not only provide high quality care, but deliver this care at a sustainable cost. Predicting risks that may lead to poor outcomes and higher costs enable providers to augment decision making for optimizing patient care and inform the risk stratification necessary in emerging reimbursement models. Healthcare delivery systems are looking at their high volume service lines and identifying variation in cost and outcomes in order to determine the patient factors that are driving this variation and …


Analyzing And Modeling Users In Multiple Online Social Platforms, Roy Lee Ka Wei Nov 2018

Analyzing And Modeling Users In Multiple Online Social Platforms, Roy Lee Ka Wei

Dissertations and Theses Collection (Open Access)

This dissertation addresses the empirical analysis on user-generated data from multiple online social platforms (OSPs) and modeling of latent user factors in multiple OSPs setting.

In the first part of this dissertation, we conducted cross-platform empirical studies to better understand user's social and work activities in multiple OSPs. In particular, we proposed new methodologies to analyze users' friendship maintenance and collaborative activities in multiple OSPs. We also apply the proposed methodologies on real-world OSP datasets, and the findings from our empirical studies have provided us with a better understanding on users' social and work activities which are previously not uncovered …


Malware Analysis On Android Using Supervised Machine Learning Techniques, Md Shohel Rana, Andrew H. Sung Oct 2018

Malware Analysis On Android Using Supervised Machine Learning Techniques, Md Shohel Rana, Andrew H. Sung

Faculty Publications

In recent years, a widespread research is conducted with the growth of malware resulted in the domain of malware analysis and detection in Android devices. Android, a mobile-based operating system currently having more than one billion active users with a high market impact that have inspired the expansion of malware by cyber criminals. Android implements a different architecture and security controls to solve the problems caused by malware, such as unique user ID (UID) for each application, system permissions, and its distribution platform Google Play. There are numerous ways to violate that fortification, and how the complexity of creating a …


Automating Intention Mining, Qiao Huang, Xin Xia, David Lo, Gail C. Murphy Oct 2018

Automating Intention Mining, Qiao Huang, Xin Xia, David Lo, Gail C. Murphy

Research Collection School Of Computing and Information Systems

Developers frequently discuss aspects of the systems they are developing online. The comments they post to discussions form a rich information source about the system. Intention mining, a process introduced by Di Sorbo et al., classifies sentences in developer discussions to enable further analysis. As one example of use, intention mining has been used to help build various recommenders for software developers. The technique introduced by Di Sorbo et al. to categorize sentences is based on linguistic patterns derived from two projects. The limited number of data sources used in this earlier work introduces questions about the comprehensiveness of intention …


Traffic-Cascade: Mining And Visualizing Lifecycles Of Traffic Congestion Events Using Public Bus Trajectories, Agus Trisnajaya Kwee, Meng-Fen Chiang, Philips Kokoh Prasetyo, Ee-Peng Lim Oct 2018

Traffic-Cascade: Mining And Visualizing Lifecycles Of Traffic Congestion Events Using Public Bus Trajectories, Agus Trisnajaya Kwee, Meng-Fen Chiang, Philips Kokoh Prasetyo, Ee-Peng Lim

Research Collection School Of Computing and Information Systems

As road transportation supports both economic and social activities in developed cities, it is important to maintain smooth traffic on all highways and local roads. Whenever possible, traffic congestions should be detected early and resolved quickly. While existing traffic monitoring dashboard systems have been put in place in many cities, these systems require high-cost vehicle speed monitoring instruments and detect traffic congestion as independent events. There is a lack of low-cost dashboards to inspect and analyze the lifecycle of traffic congestion which is critical in assessing the overall impact of congestion, determining the possible the source(s) of congestion and its …


Introduction To Reconstructability Analysis, Martin Zwick Jul 2018

Introduction To Reconstructability Analysis, Martin Zwick

Systems Science Faculty Publications and Presentations

This talk will introduce Reconstructability Analysis (RA), a data modeling methodology deriving from the 1960s work of Ross Ashby and developed in the systems community in the 1980s and afterwards. RA, based on information theory and graph theory, is a member of the family of methods known as ‘graphical models,’ which also include Bayesian networks and log-linear techniques. It is designed for exploratory modeling, although it can also be used for confirmatory hypothesis testing. RA can discover high ordinality and nonlinear interactions that are not hypothesized in advance. Its conceptual framework illuminates the relationships between wholes and parts, a subject …


Preliminary Results Of Bayesian Networks And Reconstructability Analysis Applied To The Electric Grid, Marcus Harris, Martin Zwick Jul 2018

Preliminary Results Of Bayesian Networks And Reconstructability Analysis Applied To The Electric Grid, Marcus Harris, Martin Zwick

Systems Science Faculty Publications and Presentations

Reconstructability Analysis (RA) is an analytical approach developed in the systems community that combines graph theory and information theory. Graph theory provides the structure of relations (model of the data) between variables and information theory characterizes the strength and the nature of the relations. RA has three primary approaches to model data: variable based (VB) models without loops (acyclic graphs), VB models with loops (cyclic graphs) and state-based models (nearly always cyclic, individual states specifying model constraints). These models can either be directed or neutral. Directed models focus on a single response variable whereas neutral models focus on all relations …


Reconstructability & Dynamics Of Elementary Cellular Automata, Martin Zwick Jul 2018

Reconstructability & Dynamics Of Elementary Cellular Automata, Martin Zwick

Systems Science Faculty Publications and Presentations

Reconstructability analysis (RA) is a method to determine whether a multivariate relation, defined set- or information-theoretically, is decomposable with or without loss into lower ordinality relations. Set-theoretic RA (SRA) is used to characterize the mappings of elementary cellular automata. The decomposition possible for each mapping w/o loss is a better predictor than the λ parameter (Walker & Ashby, Langton) of chaos, & non-decomposable mappings tend to produce chaos. SRA yields not only the simplest lossless structure but also a vector of losses for all structures, indexed by parameter τ. These losses are analogous to transmissions in information-theoretic RA (IRA). IRA …


Efficient Representative Subset Selection Over Sliding Windows, Yanhao Wang, Yuchen Li, Kian-Lee Tan Jul 2018

Efficient Representative Subset Selection Over Sliding Windows, Yanhao Wang, Yuchen Li, Kian-Lee Tan

Research Collection School Of Computing and Information Systems

Representative subset selection (RSS) is an important tool for users to draw insights from massive datasets. Existing literature models RSS as submodular maximization to capture the "diminishing returns" property of representativeness, but often only has a single constraint, which limits its applications to many real-world problems. To capture the recency issue and support various constraints, we formulate dynamic RSS as maximizing submodular functions subject to general d -knapsack constraints (SMDK) over sliding windows. We propose a KnapWindow framework (KW) for SMDK. KW utilizes KnapStream (KS) for SMDK in append-only streams as a subroutine. It maintains a sequence of checkpoints and …


Clustering Method Based On Graph Data Model And Reliability Detection, Yanyun Cheng, Huisong Bian, Changsheng Bian Jun 2018

Clustering Method Based On Graph Data Model And Reliability Detection, Yanyun Cheng, Huisong Bian, Changsheng Bian

Journal of System Simulation

Abstract: For the data in feature space, traditional clustering algorithm can take clustering analysis directly. High-dimensional spatial data cannot achieve intuitive and effective graphical visualization of clustering results in 2D plane. Graph data can clearly reflect the similarity relationship between objects. According to the distance of the data objects, the feature space data are modeled as graph data by iteration. Cluster analysis based on modularity is carried out on the modeling graph data. The two-dimensional visualization of non-spherical-shape distribution data cluster and result is achieved. The concept of credibility of the clustering result is proposed, and a method is proposed, …


Learning Latent Characteristics Of Locations Using Location-Based Social Networking Data, Thanh Nam Doan May 2018

Learning Latent Characteristics Of Locations Using Location-Based Social Networking Data, Thanh Nam Doan

Dissertations and Theses Collection (Open Access)

This dissertation addresses the modeling of latent characteristics of locations to describe the mobility of users of location-based social networking platforms. With many users signing up location-based social networking platforms to share their daily activities, these platforms become a gold mine for researchers to study human visitation behavior and location characteristics. Modeling such visitation behavior and location characteristics can benefit many use- ful applications such as urban planning and location-aware recommender sys- tems. In this dissertation, we focus on modeling two latent characteristics of locations, namely area attraction and neighborhood competition effects using location-based social network data. Our literature survey …


Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao Apr 2018

Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao

Theses

The problem of community structure identification has been an extensively investigated area for biology, physics, social sciences, and computer science in recent years for studying the properties of networks representing complex relationships. Most traditional methods, such as K-means and hierarchical clustering, are based on the assumption that communities have spherical configurations. Lately, Genetic Algorithms (GA) are being utilized for efficient community detection without imposing sphericity. GAs are machine learning methods which mimic natural selection and scale with the complexity of the network. However, traditional GA approaches employ a representation method that dramatically increases the solution space to be searched by …


The Algorithmic Composition Of Classical Music Through Data Mining, Tom Donald Richmond, Imad Rahal Apr 2018

The Algorithmic Composition Of Classical Music Through Data Mining, Tom Donald Richmond, Imad Rahal

All College Thesis Program, 2016-2019

The desire to teach a computer how to algorithmically compose music has been a topic in the world of computer science since the 1950’s, with roots of computer-less algorithmic composition dating back to Mozart himself. One limitation of algorithmically composing music has been the difficulty of eliminating the human intervention required to achieve a musically homogeneous composition. We attempt to remedy this issue by teaching a computer how the rules of composition differ between the six distinct eras of classical music by having it examine a dataset of musical scores, rather than explicitly telling the computer the formal rules of …


Opportunity Identification For New Product Planning: Ontological Semantic Patent Classification, Farshad Madani Feb 2018

Opportunity Identification For New Product Planning: Ontological Semantic Patent Classification, Farshad Madani

Dissertations and Theses

Intelligence tools have been developed and applied widely in many different areas in engineering, business and management. Many commercialized tools for business intelligence are available in the market. However, no practically useful tools for technology intelligence are available at this time, and very little academic research in technology intelligence methods has been conducted to date.

Patent databases are the most important data source for technology intelligence tools, but patents inherently contain unstructured data. Consequently, extracting text data from patent databases, converting that data to meaningful information and generating useful knowledge from this information become complex tasks. These tasks are currently …


Statistical Analysis Of Network Change, Teresa D. Schmidt, Martin Zwick Feb 2018

Statistical Analysis Of Network Change, Teresa D. Schmidt, Martin Zwick

Systems Science Faculty Publications and Presentations

Networks are rarely subjected to hypothesis tests for difference, but when they are inferred from datasets of independent observations statistical testing is feasible. To demonstrate, a healthcare provider network is tested for significant change after an intervention using Medicaid claims data. First, the network is inferred for each time period with (1) partial least squares (PLS) regression and (2) reconstructability analysis (RA). Second, network distance (i.e., change between time periods) is measured as the mean absolute difference in (1) coefficient matrices for PLS and (2) calculated probability distributions for RA. Third, the network distance is compared against a reference distribution …


Classification Using Association Rules, Colin Kane Jan 2018

Classification Using Association Rules, Colin Kane

Dissertations

This research investigates the use of an unsupervised learning technique, association rules, to make class predictions. The use of association rules to make class predictions is a growing area of focus within data mining research. The research to date has focused predominately on balanced datasets or synthetized imbalanced datasets. There have been concerns raised that the algorithms using association rules to make classifications do not perform well on imbalanced datasets. This research comprehensively evaluates the accuracy of a number of association rule classifiers in predicting home loan sales in an Irish retail banking context. The experiments designed test three associative …


Continuous Restricted Boltzmann Machines, Robert W. Harrison Jan 2018

Continuous Restricted Boltzmann Machines, Robert W. Harrison

EBCS Articles

Restricted Boltzmann machines are a generative neural network. They summarize their input data to build a probabilistic model that can then be used to reconstruct missing data or to classify new data. Unlike discrete Boltzmann machines, where the data are mapped to the space of integers or bitstrings, continuous Boltzmann machines directly use floating point numbers and therefore represent the data with higher fidelity. The primary limitation in using Boltzmann machines for big-data problems is the efficiency of the training algorithm. This paper describes an efficient deterministic algorithm for training continuous machines.


Real-Time Power System Dynamic Security Assessment Based On Advanced Feature Selection For Decision Tree Classifiers, Qusay Al-Gubri, Mohd Aifaa Mohd Ariff Jan 2018

Real-Time Power System Dynamic Security Assessment Based On Advanced Feature Selection For Decision Tree Classifiers, Qusay Al-Gubri, Mohd Aifaa Mohd Ariff

Turkish Journal of Electrical Engineering and Computer Sciences

This paper proposes a novel algorithm based on an advanced feature selection technique for the decision tree (DT) classifier to assess the dynamic security in a power system. The proposed methodology utilizes symmetrical uncertainty (SU) to reduce the data redundancy in a dataset for DT classifier-based dynamic security assessment (DSA) tools. The results show that SU reduces the dimension of the dataset used for DSA significantly. Subsequently, the approach improves the performance of the DT classifier. The effectiveness of the proposed technique is demonstrated on the modified IEEE 30-bus test system model. The results show that the DT classifier with …


Exploratory Reconstructability Analysis Of Accident Tbi Data, Martin Zwick, Nancy Ann Carney, Rosemary Nettleton Jan 2018

Exploratory Reconstructability Analysis Of Accident Tbi Data, Martin Zwick, Nancy Ann Carney, Rosemary Nettleton

Systems Science Faculty Publications and Presentations

This paper describes the use of reconstructability analysis to perform a secondary study of traumatic brain injury data from automobile accidents. Neutral searches were done and their results displayed with a hypergraph. Directed searches, using both variable-based and state-based models, were applied to predict performance on two cognitive tests and one neurological test. Very simple state-based models gave large uncertainty reductions for all three DVs and sizeable improvements in percent correct for the two cognitive test DVs which were equally sampled. Conditional probability distributions for these models are easily visualized with simple decision trees. Confounding variables and counter-intuitive findings are …


Clicking Into Mortgage Arrears: A Study Into Arrears Prediction With Clickstream Data, Gavin O'Brien Jan 2018

Clicking Into Mortgage Arrears: A Study Into Arrears Prediction With Clickstream Data, Gavin O'Brien

Dissertations

This research project investigates the predictive capability of clickstream data when used for the purpose of mortgage arrears prediction. With an ever growing number of people switching to digital channels to handle their daily banking requirements, there is a wealth of ever increasing online usage data, otherwise known as clickstream data. If leveraged correctly, this clickstream data can be a powerful data source for organisations as it provides detailed information about how their customers are interacting with their digital channels. Much of the current literature associated with clickstream data relates to organisations employing it within their customer relationship management mechanisms …


An Efficient System For Subgraph Discovery, Aparna Joshi Jan 2018

An Efficient System For Subgraph Discovery, Aparna Joshi

Legacy Theses & Dissertations (2009 - 2024)

Subgraph discovery in a single data graph---finding subsets of vertices and edges satisfying a user-specified criteria---is an essential and general graph analytics operation with a wide spectrum of applications. Depending on the criteria, subgraphs of interest may correspond to cliques of friends in social networks, interconnected entities in RDF data, or frequent patterns in protein interaction networks to name a few. Existing systems usually examine a large number of subgraphs while employing many computers and often produce an enormous result set of subgraphs. How can we enable fast discovery of only the most relevant subgraphs while minimizing the computational requirements?


Clinical Information Extraction From Unstructured Free-Texts, Mingzhe Tao Jan 2018

Clinical Information Extraction From Unstructured Free-Texts, Mingzhe Tao

Legacy Theses & Dissertations (2009 - 2024)

Information extraction (IE) is a fundamental component of natural language processing (NLP) that provides a deeper understanding of the texts. In the clinical domain, documents prepared by medical experts (e.g., discharge summaries, drug labels, medical history records) contain a significant amount of clinically-relevant information that is crucial to the overall well-being of patients. Unfortunately, in many cases, clinically-relevant information is presented in an unstructured format, predominantly consisting of free-texts, making it inaccessible to computerized methods. Automatic extraction of this information can improve accessibility. However, the presence of synonymous expressions, medical acronyms, misspellings, negated phrases, and ambiguous terminologies make automatic extraction …