Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

2014

Discipline
Institution
Publication
Publication Type

Articles 1 - 29 of 29

Full-Text Articles in Computer Sciences

Drip - Data Rich, Information Poor: A Concise Synopsis Of Data Mining, Muhammad Obeidat, Max North, Lloyd Burgess, Sarah North Dec 2014

Drip - Data Rich, Information Poor: A Concise Synopsis Of Data Mining, Muhammad Obeidat, Max North, Lloyd Burgess, Sarah North

Faculty and Research Publications

As production of data is exponentially growing with a drastically lower cost, the importance of data mining required to extract and discover valuable information is becoming more paramount. To be functional in any business or industry, data must be capable of supporting sound decision-making and plausible prediction. The purpose of this paper is concisely but broadly to provide a synopsis of the technology and theory of data mining, providing an enhanced comprehension of the methods by which massive data can be transferred into meaningful information.


Twitter Location (Sometimes) Matters: Exploring The Relationship Between Georeferenced Tweet Content And Nearby Feature Classes, Stefan Hahmann, Ross S. Purves, Dirk Burghardt Dec 2014

Twitter Location (Sometimes) Matters: Exploring The Relationship Between Georeferenced Tweet Content And Nearby Feature Classes, Stefan Hahmann, Ross S. Purves, Dirk Burghardt

Journal of Spatial Information Science

In this paper, we investigate whether microblogging texts (tweets) produced on mobile devices are related to the geographical locations where they were posted. For this purpose, we correlate tweet topics to areas. In doing so, classified points of interest from OpenStreetMap serve as validation points. We adopted the classification and geolocation of these points to correlate with tweet content by means of manual, supervised, and unsupervised machine learning approaches. Evaluation showed the manual classification approach to be highest quality, followed by the supervised method, and that the unsupervised classification was of low quality. We found that the degree to which …


Social Fingerprinting: Identifying Users Of Social Networks By Their Data Footprint, Denise Koessler Gosnell Dec 2014

Social Fingerprinting: Identifying Users Of Social Networks By Their Data Footprint, Denise Koessler Gosnell

Doctoral Dissertations

This research defines, models, and quantifies a new metric for social networks: the social fingerprint. Just as one's fingers leave behind a unique trace in a print, this dissertation introduces and demonstrates that the manner in which people interact with other accounts on social networks creates a unique data trail. Accurate identification of a user's social fingerprint can address the growing demand for improved techniques in unique user account analysis, computational forensics and social network analysis.

In this dissertation, we theorize, construct and test novel software and methodologies which quantify features of social network data. All approaches and methodologies are …


Time-Series Data Mining In Transportation: A Case Study On Singapore Public Train Commuter Travel Patterns, Roy Ka Wei Lee, Tin Seong Kam Oct 2014

Time-Series Data Mining In Transportation: A Case Study On Singapore Public Train Commuter Travel Patterns, Roy Ka Wei Lee, Tin Seong Kam

Research Collection School Of Computing and Information Systems

The adoption of smart cards technologies and automated data collection systems (ADCS) in transportation domain had provided public transport planners opportunities to amass a huge and continuously increasing amount of time-series data about the behaviors and travel patterns of commuters. However the explosive growth of temporal related databases has far outpaced the transport planners’ ability to interpret these data using conventional statistical techniques, creating an urgent need for new techniques to support the analyst in transforming the data into actionable information and knowledge. This research study thus explores and discusses the potential use of time-series data mining, a relatively new …


Collaborative Online Multitask Learning, Guangxia Li, Steven C. H. Hoi, Kuiyu Chang, Wenting Liu, Ramesh Jain Aug 2014

Collaborative Online Multitask Learning, Guangxia Li, Steven C. H. Hoi, Kuiyu Chang, Wenting Liu, Ramesh Jain

Research Collection School Of Computing and Information Systems

We study the problem of online multitask learning for solving multiple related classification tasks in parallel, aiming at classifying every sequence of data received by each task accurately and efficiently. One practical example of online multitask learning is the micro-blog sentiment detection on a group of users, which classifies micro-blog posts generated by each user into emotional or non-emotional categories. This particular online learning task is challenging for a number of reasons. First of all, to meet the critical requirements of online applications, a highly efficient and scalable classification solution that can make immediate predictions with low learning cost is …


A Knowledge Discovery Approach For The Detection Of Power Grid State Variable Attacks, Nathan Wallace Jul 2014

A Knowledge Discovery Approach For The Detection Of Power Grid State Variable Attacks, Nathan Wallace

Doctoral Dissertations

As the level of sophistication in power system technologies increases, the amount of system state parameters being recorded also increases. This data not only provides an opportunity for monitoring and diagnostics of a power system, but it also creates an environment wherein security can be maintained. Being able to extract relevant information from this pool of data is one of the key challenges still yet to be obtained in the smart grid. The potential exists for the creation of innovative power grid cybersecurity applications, which harness the information gained from advanced analytics. Such analytics can be based on the extraction …


Mining Branching-Time Scenarios, Dirk Fahland, David Lo, Shahar Maoz Jun 2014

Mining Branching-Time Scenarios, Dirk Fahland, David Lo, Shahar Maoz

David LO

Specification mining extracts candidate specification from existing systems, to be used for downstream tasks such as testing and verification. Specifically, we are interested in the extraction of behavior models from execution traces. In this paper we introduce mining of branching-time scenarios in the form of existential, conditional Live Sequence Charts, using a statistical data-mining algorithm. We show the power of branching scenarios to reveal alternative scenario-based behaviors, which could not be mined by previous approaches. The work contrasts and complements previous works on mining linear-time scenarios. An implementation and evaluation over execution trace sets recorded from several real-world applications shows …


Automated Library Recommendation, Ferdian Thung, David Lo, Julia Lawall Jun 2014

Automated Library Recommendation, Ferdian Thung, David Lo, Julia Lawall

David LO

Many third party libraries are available to be downloaded and used. Using such libraries can reduce development time and make the developed software more reliable. However, developers are often unaware of suitable libraries to be used for their projects and thus they miss out on these benefits. To help developers better take advantage of the available libraries, we propose a new technique that automatically recommends libraries to developers. Our technique takes as input the set of libraries that an application currently uses, and recommends other libraries that are likely to be relevant. We follow a hybrid approach that combines association …


Ar-Miner: Mining Informative Reviews For Developers From Mobile App Marketplace, Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang Jun 2014

Ar-Miner: Mining Informative Reviews For Developers From Mobile App Marketplace, Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang

Research Collection School Of Computing and Information Systems

With the popularity of smartphones and mobile devices, mobile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and downloads. App developers spend considerable effort on collecting and exploiting user feedback to improve user satisfaction, but suffer from the absence of effective user review analytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapidly increasing pool of user reviews, we present “AR-Miner” — a novel computational framework for App Review Mining, which performs comprehensive analytics from raw user reviews by (i) first extracting informative user reviews by …


Ranking-Based Approaches For Localizing Faults, Lucia Lucia Jun 2014

Ranking-Based Approaches For Localizing Faults, Lucia Lucia

Dissertations and Theses Collection (Open Access)

A fault is the root cause of program failures where a program behaves differently from the intended behavior. Finding or localizing faults is often laborious (especially so for complex programs), yet it is an important task in the software lifecycle. An automated technique that can accurately and quickly identify the faulty code is greatly needed to alleviate the costs of software debugging. Many fault localization techniques assume that faults are localizable, i.e., each fault manifests only in a single or a few lines of code that are close to one another. To verify this assumption, we study how faults spread …


A Continuous Learning Strategy For Self-Organizing Maps Based On Convergence Windows, Gregory T. Breard May 2014

A Continuous Learning Strategy For Self-Organizing Maps Based On Convergence Windows, Gregory T. Breard

Senior Honors Projects

A self-organizing map (SOM) is a type of artificial neural network that has applications in a variety of fields and disciplines. The SOM algorithm uses unsupervised learning to produce a low-dimensional representation of high- dimensional data. This is done by 'fitting' a grid of nodes to a data set over a fixed number of iterations. With each iteration, the nodes of the map are adjusted so that they appear more like the data points. The low-dimensionality of the resulting map means that it can be presented graphically and be more intuitively interpreted by humans. However, it is still essential to …


Corl8: A System For Analyzing Diagnostic Measures In Wireless Sensor Networks, Loren Klingman May 2014

Corl8: A System For Analyzing Diagnostic Measures In Wireless Sensor Networks, Loren Klingman

All Theses

Due to an increasing demand to monitor the physical world, researchers are deploying wireless sensor networks more than ever before. These networks comprise a large number of sensors integrated with small, low-power wireless transceivers used to transmit data to a central processing and storage location. These devices are often deployed in harsh, volatile locations, which increases their failure rate and decreases the rate at which packets can be successfully transmitted. Existing sensor debugging tools, such as Sympathy and EmStar, rely on add-in network protocols to report status information, and to collectively diagnose network problems. Some protocols rely on a central …


On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo Apr 2014

On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo

Research Collection School Of Computing and Information Systems

Gaming expertise is usually accumulated through playing or watching many game instances, and identifying critical moments in these game instances called turning points. Turning point rules (shorten as TPRs) are game patterns that almost always lead to some irreversible outcomes. In this paper, we formulate the notion of irreversible outcome property which can be combined with pattern mining so as to automatically extract TPRs from any given game datasets. We specifically extend the well-known PrefixSpan sequence mining algorithm by incorporating the irreversible outcome property. To show the usefulness of TPRs, we apply them to Tetris, a popular game. We mine …


Machine Learning In Wireless Sensor Networks: Algorithms, Strategies, And Applications, Mohammad Abu Alsheikh, Shaowei Lin, Dusit Niyato, Hwee-Pink Tan Apr 2014

Machine Learning In Wireless Sensor Networks: Algorithms, Strategies, And Applications, Mohammad Abu Alsheikh, Shaowei Lin, Dusit Niyato, Hwee-Pink Tan

Research Collection School Of Computing and Information Systems

Wireless sensor networks (WSNs) monitor dynamic environments that change rapidly over time. This dynamic behavior is either caused by external factors or initiated by the system designers themselves. To adapt to such conditions, sensor networks often adopt machine learning techniques to eliminate the need for unnecessary redesign. Machine learning also inspires many practical solutions that maximize resource utilization and prolong the lifespan of the network. In this paper, we present an extensive literature review over the period 2002-2013 of machine learning methods that were used to address common issues in WSNs. The advantages and disadvantages of each proposed algorithm are …


Applicability Of Latent Dirichlet Allocation To Multi-Disk Search, George E. Noel, Gilbert L. Peterson Mar 2014

Applicability Of Latent Dirichlet Allocation To Multi-Disk Search, George E. Noel, Gilbert L. Peterson

Faculty Publications

Digital forensics practitioners face a continual increase in the volume of data they must analyze, which exacerbates the problem of finding relevant information in a noisy domain. Current technologies make use of keyword based search to isolate relevant documents and minimize false positives with respect to investigative goals. Unfortunately, selecting appropriate keywords is a complex and challenging task. Latent Dirichlet Allocation (LDA) offers a possible way to relax keyword selection by returning topically similar documents. This research compares regular expression search techniques and LDA using the Real Data Corpus (RDC). The RDC, a set of over 2400 disks from real …


On Predicting User Affiliations Using Social Features In Online Social Networks, Minh Thap Nguyen Mar 2014

On Predicting User Affiliations Using Social Features In Online Social Networks, Minh Thap Nguyen

Dissertations and Theses Collection (Open Access)

User profiling such as user affiliation prediction in online social network is a challenging task, with many important applications in targeted marketing and personalized recommendation. The research task here is to predict some user affiliation attributes that suggest user participation in different social groups.


A Computational Approach To Qualitative Analysis In Large Textual Datasets, Michael Evans Feb 2014

A Computational Approach To Qualitative Analysis In Large Textual Datasets, Michael Evans

Dartmouth Scholarship

In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on …


Hot Zone Identification: Analyzing Effects Of Data Sampling On Spam Clustering, Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague Jan 2014

Hot Zone Identification: Analyzing Effects Of Data Sampling On Spam Clustering, Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague

Journal of Digital Forensics, Security and Law

Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process …


M-Fdbscan: A Multicore Density-Based Uncertain Data Clustering Algorithm, Atakan Erdem, Taflan İmre Gündem Jan 2014

M-Fdbscan: A Multicore Density-Based Uncertain Data Clustering Algorithm, Atakan Erdem, Taflan İmre Gündem

Turkish Journal of Electrical Engineering and Computer Sciences

In many data mining applications, we use a clustering algorithm on a large amount of uncertain data. In this paper, we adapt an uncertain data clustering algorithm called fast density-based spatial clustering of applications with noise (FDBSCAN) to multicore systems in order to have fast processing. The new algorithm, which we call multicore FDBSCAN (M-FDBSCAN), splits the data domain into c rectangular regions, where c is the number of cores in the system. The FDBSCAN algorithm is then applied to each rectangular region simultaneously. After the clustering operation is completed, semiclusters that occur during splitting are detected and merged to …


Discovery Of Hydrometeorological Patterns, Mete Çeli̇k, Fi̇li̇z Dadaşer Çeli̇k, Ahmet Şaki̇r Dokuz Jan 2014

Discovery Of Hydrometeorological Patterns, Mete Çeli̇k, Fi̇li̇z Dadaşer Çeli̇k, Ahmet Şaki̇r Dokuz

Turkish Journal of Electrical Engineering and Computer Sciences

Hydrometeorological patterns can be defined as meaningful and nontrivial associations between hydrological and meteorological parameters over a region. Discovering hydrometeorological patterns is important for many applications, including forecasting hydrometeorological hazards (floods and droughts), predicting the hydrological responses of ungauged basins, and filling in missing hydrological or meteorological records. However, discovering these patterns is challenging due to the special characteristics of hydrological and meteorological data, and is computationally complex due to the archival history of the datasets. Moreover, defining monotonic interest measures to quantify these patterns is difficult. In this study, we propose a new monotonic interest measure, called the hydrometeorological …


Data Mining Based Hybridization Of Meta-Raps, Fatemah Al-Duoli, Ghaith Rabadi Jan 2014

Data Mining Based Hybridization Of Meta-Raps, Fatemah Al-Duoli, Ghaith Rabadi

Engineering Management & Systems Engineering Faculty Publications

Though metaheuristics have been frequently employed to improve the performance of data mining algorithms, the opposite is not true. This paper discusses the process of employing a data mining algorithm to improve the performance of a metaheuristic algorithm. The targeted algorithms to be hybridized are the Meta-heuristic for Randomized Priority Search (Meta-RaPS) and an algorithm used to create an Inductive Decision Tree. This hybridization focuses on using a decision tree to perform on-line tuning of the parameters in Meta-RaPS. The process makes use of the information collected during the iterative construction and improvement phases Meta-RaPS performs. The data mining algorithm …


Multi-Threaded Implementation Of Association Rule Mining With Visualization Of The Pattern Tree, Eera Gupta Jan 2014

Multi-Threaded Implementation Of Association Rule Mining With Visualization Of The Pattern Tree, Eera Gupta

LSU Master's Theses

Motor Vehicle fatalities per 100,000 population in the United States has been reported to be 10.69% in the year 2012 as per NHTSA (National Highway Traffic Safety Administration). The fatality rate has increased by 0.27% in 2012 compared to the rate in the year 2011. As per the reports, there are many factors involved in increasing the fatality rate drastically such as driving under influence, testing while driving, and various other weather phenomena. Decision makers need to analyze the factors attributing to the increase in an accident rate to take implied measures. Current methods used to perform the data analysis …


An Urgent Precaution System To Detect Students At Risk Of Substance Abuse Through Classification Algorithms, Faruk Bulut, İhsan Ömür Bucak Jan 2014

An Urgent Precaution System To Detect Students At Risk Of Substance Abuse Through Classification Algorithms, Faruk Bulut, İhsan Ömür Bucak

Turkish Journal of Electrical Engineering and Computer Sciences

In recent years, the use of addictive drugs and substances has turned out to be a challenging social problem worldwide. The illicit use of these types of drugs and substances appears to be increasing among elementary and high school students. After becoming addicted to drugs, life becomes unbearable and gets even worse for their users. Scientific studies show that it becomes extremely difficult for an individual to break this habit after being a user. Hence, preventing teenagers from addiction becomes an important issue. This study focuses on an urgent precaution system that helps families and educators prevent teenagers from developing …


Geospatial Data Pre-Processing On Watershed Datasets: A Gis Approach, Sreedhar Nallan, Leisa Armstrong, Barry Croke, Amiya K. Tripathy Jan 2014

Geospatial Data Pre-Processing On Watershed Datasets: A Gis Approach, Sreedhar Nallan, Leisa Armstrong, Barry Croke, Amiya K. Tripathy

Research outputs 2014 to 2021

Spatial data mining helps to identify interesting patterns from the spatial data sets. However, geo spatial data requires substantial data pre-processing before data can be interrogated further using data mining techniques. Multi-dimensional spatial data has been used to explain the spatial analysis and SOLAP for pre-processing data. This paper examines some of the methods for pre-processing of the data using Arc GIS 10.2 and Spatial Analyst with a case study dataset of a watershed.


Graph Mining And Module Detection In Protein-Protein Interaction Networks, Ru Shen Jan 2014

Graph Mining And Module Detection In Protein-Protein Interaction Networks, Ru Shen

Legacy Theses & Dissertations (2009 - 2024)

Graphs are intuitive representations of relational data. Graphs have been widely used to represent biological molecular networks that operate in the living systems. In the study of systems biology, using graph mining techniques and graph-theory-based algorithms to


Roughened Random Forests For Binary Classification, Kuangnan Xiong Jan 2014

Roughened Random Forests For Binary Classification, Kuangnan Xiong

Legacy Theses & Dissertations (2009 - 2024)

Binary classification plays an important role in many decision-making processes. Random forests can build a strong ensemble classifier by combining weaker classification trees that are de-correlated. The strength and correlation among individual classification trees are the key factors that contribute to the ensemble performance of random forests. We propose roughened random forests, a new set of tools which show further improvement over random forests in binary classification. Roughened random forests modify the original dataset for each classification tree and further reduce the correlation among individual classification trees. This data modification process is composed of artificially imposing missing data that are …


Detecting Click Fraud In Online Advertising: A Data Mining Approach, Richard Oentaryo, Ee Peng Lim, Michael Finegold, David Lo, Feida Zhu, Clifton Phua, Eng-Yeow Cheu, Ghim-Eng Yap, Kelvin Sim, Kasun Perera, Bijay Neupane, Mustafa Faisal, Zeyar Aung, Wei Lee Woon, Wei Chen, Dhaval Patel, Daniel Berrar Jan 2014

Detecting Click Fraud In Online Advertising: A Data Mining Approach, Richard Oentaryo, Ee Peng Lim, Michael Finegold, David Lo, Feida Zhu, Clifton Phua, Eng-Yeow Cheu, Ghim-Eng Yap, Kelvin Sim, Kasun Perera, Bijay Neupane, Mustafa Faisal, Zeyar Aung, Wei Lee Woon, Wei Chen, Dhaval Patel, Daniel Berrar

Research Collection School Of Computing and Information Systems

Click fraud - the deliberate clicking on advertisements with no real interest on the product or service offered - is one of the most daunting problems in online advertising. Building an elective fraud detection method is thus pivotal for online advertising businesses. We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, opening the opportunity for participants to work on real-world fraud data from BuzzCity Pte. Ltd., a global mobile advertising company based in Singapore. In particular, the task is to identify fraudulent publishers who generate illegitimate clicks, and distinguish them from normal publishers. The competition was held from …


How Can Consumer Preferences Be Leveraged For Targeted Upselling In Cable Tv Services?, Bing Tian Dai Jan 2014

How Can Consumer Preferences Be Leveraged For Targeted Upselling In Cable Tv Services?, Bing Tian Dai

Research Collection School Of Computing and Information Systems

Internet TV has attracted a significant amount of attention from the conventional cable TV service providers, by providing customized TV programs at preferred time slots. The cable TV service providers are seeking to retain their customers by giving them a better experience: by understanding their customers’ preferences and upselling them the right products to cater to their interests. It is not easy to understand customer preferences though, since customers are not able to watch channels to which they have not subscribed. This makes it difficult to predict what they will like to watch, as a result. In this paper, I …


Guiding Data-Driven Transportation Decisions, Kristin A. Tufte, Basem Elazzabi, Nathan Hall, Morgan Harvey, Kath Knobe, David Maier, Veronika Margaret Megler Jan 2014

Guiding Data-Driven Transportation Decisions, Kristin A. Tufte, Basem Elazzabi, Nathan Hall, Morgan Harvey, Kath Knobe, David Maier, Veronika Margaret Megler

Computer Science Faculty Publications and Presentations

Urban transportation professionals are under increasing pressure to perform data-driven decision making and to provide data-driven performance metrics. This pressure comes from sources including the federal government and is driven, in part, by the increased volume and variety of transportation data available. This sudden increase of data is partially a result of improved technology for sensors and mobile devices as well as reduced device and storage costs. However, using this proliferation of data for decisions and performance metrics is proving to be difficult. In this paper, we describe a proposed structure for a system to support data-driven decision making. A …