Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics

PDF

2014

Data mining

Institution
Publication
Publication Type

Articles 1 - 30 of 38

Full-Text Articles in Entire DC Network

Drip - Data Rich, Information Poor: A Concise Synopsis Of Data Mining, Muhammad Obeidat, Max North, Lloyd Burgess, Sarah North Dec 2014

Drip - Data Rich, Information Poor: A Concise Synopsis Of Data Mining, Muhammad Obeidat, Max North, Lloyd Burgess, Sarah North

Faculty Articles

As production of data is exponentially growing with a drastically lower cost, the importance of data mining required to extract and discover valuable information is becoming more paramount. To be functional in any business or industry, data must be capable of supporting sound decision-making and plausible prediction. The purpose of this paper is concisely but broadly to provide a synopsis of the technology and theory of data mining, providing an enhanced comprehension of the methods by which massive data can be transferred into meaningful information.


Twitter Location (Sometimes) Matters: Exploring The Relationship Between Georeferenced Tweet Content And Nearby Feature Classes, Stefan Hahmann, Ross S. Purves, Dirk Burghardt Dec 2014

Twitter Location (Sometimes) Matters: Exploring The Relationship Between Georeferenced Tweet Content And Nearby Feature Classes, Stefan Hahmann, Ross S. Purves, Dirk Burghardt

Journal of Spatial Information Science

In this paper, we investigate whether microblogging texts (tweets) produced on mobile devices are related to the geographical locations where they were posted. For this purpose, we correlate tweet topics to areas. In doing so, classified points of interest from OpenStreetMap serve as validation points. We adopted the classification and geolocation of these points to correlate with tweet content by means of manual, supervised, and unsupervised machine learning approaches. Evaluation showed the manual classification approach to be highest quality, followed by the supervised method, and that the unsupervised classification was of low quality. We found that the degree to which …


Social Fingerprinting: Identifying Users Of Social Networks By Their Data Footprint, Denise Koessler Gosnell Dec 2014

Social Fingerprinting: Identifying Users Of Social Networks By Their Data Footprint, Denise Koessler Gosnell

Doctoral Dissertations

This research defines, models, and quantifies a new metric for social networks: the social fingerprint. Just as one's fingers leave behind a unique trace in a print, this dissertation introduces and demonstrates that the manner in which people interact with other accounts on social networks creates a unique data trail. Accurate identification of a user's social fingerprint can address the growing demand for improved techniques in unique user account analysis, computational forensics and social network analysis.

In this dissertation, we theorize, construct and test novel software and methodologies which quantify features of social network data. All approaches and methodologies are …


Optimal Ranking Regime Analysis Of Intra- To Multidecadal U.S. Climate Variability. Part I: Temperature, Eugene C. Cordero, Steven A. Mauget Dec 2014

Optimal Ranking Regime Analysis Of Intra- To Multidecadal U.S. Climate Variability. Part I: Temperature, Eugene C. Cordero, Steven A. Mauget

Faculty Publications, Meteorology and Climate Science

The optimal ranking regime (ORR) method was used to identify intradecadal to multidecadal (IMD) time windows containing significant ranking sequences in U.S. climate division temperature data. The simplicity of the ORR procedure’s output—a time series’ most significant nonoverlapping periods of high or low rankings—makes it possible to graphically identify common temporal breakpoints and spatial patterns of IMD variability in the analyses of 102 climate division temperature series. This approach is also applied to annual Atlantic multidecadal oscillation (AMO) and Pacific decadal oscillation (PDO) climate indices, a Northern Hemisphere annual temperature (NHT) series, and divisional annual and seasonal temperature data during …


Optimal Ranking Regime Analysis Of Intra- To Multidecadal U.S. Climate Variability. Part I: Temperature, Eugene C. Cordero, Steven A. Mauget Dec 2014

Optimal Ranking Regime Analysis Of Intra- To Multidecadal U.S. Climate Variability. Part I: Temperature, Eugene C. Cordero, Steven A. Mauget

Eugene C. Cordero

The optimal ranking regime (ORR) method was used to identify intradecadal to multidecadal (IMD) time windows containing significant ranking sequences in U.S. climate division temperature data. The simplicity of the ORR procedure’s output—a time series’ most significant nonoverlapping periods of high or low rankings—makes it possible to graphically identify common temporal breakpoints and spatial patterns of IMD variability in the analyses of 102 climate division temperature series. This approach is also applied to annual Atlantic multidecadal oscillation (AMO) and Pacific decadal oscillation (PDO) climate indices, a Northern Hemisphere annual temperature (NHT) series, and divisional annual and seasonal temperature data during …


Time-Series Data Mining In Transportation: A Case Study On Singapore Public Train Commuter Travel Patterns, Roy Ka Wei Lee, Tin Seong Kam Oct 2014

Time-Series Data Mining In Transportation: A Case Study On Singapore Public Train Commuter Travel Patterns, Roy Ka Wei Lee, Tin Seong Kam

Research Collection School Of Computing and Information Systems

The adoption of smart cards technologies and automated data collection systems (ADCS) in transportation domain had provided public transport planners opportunities to amass a huge and continuously increasing amount of time-series data about the behaviors and travel patterns of commuters. However the explosive growth of temporal related databases has far outpaced the transport planners’ ability to interpret these data using conventional statistical techniques, creating an urgent need for new techniques to support the analyst in transforming the data into actionable information and knowledge. This research study thus explores and discusses the potential use of time-series data mining, a relatively new …


The Doubly Adaptive Lasso Methods For Time Series Analysis, Zi Zhen Liu Aug 2014

The Doubly Adaptive Lasso Methods For Time Series Analysis, Zi Zhen Liu

Electronic Thesis and Dissertation Repository

In this thesis, we propose a systematic approach called the doubly adaptive LASSO tailored to time series analysis, which includes four specific methods for four time series models, respectively:

The PAC-weighted adaptive LASSO for univariate autoregressive (AR) models. Although the LASSO methodology has been applied to AR models, the existing methods in the literature ignore the temporal dependence information embedded in AR time series data. Consequently, the methods may not reflect the characteristics of underlying AR processes, especially, the lag order of AR models. The PAC-weighted adaptive LASSO incorporates the partial autocorrelation (PAC) into the adaptive LASSO weights. The PAC-weighted …


Data Mining For Design Flood Prediction, James E. Ball Aug 2014

Data Mining For Design Flood Prediction, James E. Ball

International Conference on Hydroinformatics

Design flood estimation remains a problem for many professionals involved in the management of rural and urban catchments. Advice is required regarding design flood characteristics for many design problems including the design of culverts and bridges necessary for cross drainage of transport routes, the design of urban drainage systems, the design of flood mitigation levees and other flood mitigation structures, design of dam spillways, and many environmental flow problems. When a risk based approach is adopted as the design paradigm, there is a need to predict both the magnitude of the hazard and the exceedance probability of the hazard. In …


Self-Organizing Maps For Knowledge Discovery From Corporate Databases To Develop Risk Based Prioritization For Stagnation, Stephen Robert Mounce, Rebecca Sharpe, Vanessa Speight, Barrie Holden, Joby Boxall Aug 2014

Self-Organizing Maps For Knowledge Discovery From Corporate Databases To Develop Risk Based Prioritization For Stagnation, Stephen Robert Mounce, Rebecca Sharpe, Vanessa Speight, Barrie Holden, Joby Boxall

International Conference on Hydroinformatics

Stagnation or low turnover of water within water distribution systems may result in water quality issues, even for relatively short durations of stagnation / low turnover if other factors such as deteriorated aging pipe infrastructure are present. As leakage management strategies, including the creation of smaller pressure management zones, are implemented increasingly more dead ends are being created within networks and hence potentially there is an increasing risk to water quality due to stagnation / low turnover. This paper presents results of applying data driven tools to the large corporate databases maintained by UK water companies. These databases include multiple …


Monitoring Spatiotemporal Total Organic Carbon Concentrations In Lake Mead With Integrated Data Fusion And Mining (Idfm) Technique, Sanaz Imen, Ni-Bin Chang, Y. Jeffrey Yang Aug 2014

Monitoring Spatiotemporal Total Organic Carbon Concentrations In Lake Mead With Integrated Data Fusion And Mining (Idfm) Technique, Sanaz Imen, Ni-Bin Chang, Y. Jeffrey Yang

International Conference on Hydroinformatics

Forest fires, soil erosion, and land use changes in Lake Mead watershed nearby Las Vegas wash are considered as sources of water quality impairment in the Lake Mead. These conditions result in higher concentration of Total Organic Carbon (TOC). TOC in contact with Chlorine which is often used for disinfection purposes of drinking water supply causes the formation of trihalomethanes (THMs). THM is one of the toxic carcinogens controlled by EPA’s Disinfection By-Product Rule. As a result of threat posed to drinking water of 25 million people downstream, recreation area, and wildlife habitat of Lake Mead, it is necessary to …


Detecting Pipe Bursts In Water Distribution Networks Using Epr Modeling Paradigm, Luigi Berardi, Daniele Laucelli, Orazio Giustolisi, Dragan A. Savić Aug 2014

Detecting Pipe Bursts In Water Distribution Networks Using Epr Modeling Paradigm, Luigi Berardi, Daniele Laucelli, Orazio Giustolisi, Dragan A. Savić

International Conference on Hydroinformatics

Sustainable management of water distribution networks requires the timely detection of water leakages from pipelines. This will reduce wastage of resource, decrease cost of treatment and pumping, cut third party damage and reduce green house gas emissions. Some recently developed methodologies permit real time detection of pipe burst events by analyzing signals from pressure and flow meters located in District Metered Areas. These procedures are conceptually based on: (i) data preparation (e.g. de-noising; reconstruction); (ii) predictions based on data-driven models; (iii) identification of anomalies in flow/pressure and raising alerts based on a mismatch between model predictions and signals from meters. …


Collaborative Online Multitask Learning, Guangxia Li, Steven C. H. Hoi, Kuiyu Chang, Wenting Liu, Ramesh Jain Aug 2014

Collaborative Online Multitask Learning, Guangxia Li, Steven C. H. Hoi, Kuiyu Chang, Wenting Liu, Ramesh Jain

Research Collection School Of Computing and Information Systems

We study the problem of online multitask learning for solving multiple related classification tasks in parallel, aiming at classifying every sequence of data received by each task accurately and efficiently. One practical example of online multitask learning is the micro-blog sentiment detection on a group of users, which classifies micro-blog posts generated by each user into emotional or non-emotional categories. This particular online learning task is challenging for a number of reasons. First of all, to meet the critical requirements of online applications, a highly efficient and scalable classification solution that can make immediate predictions with low learning cost is …


Using Weka To Mine Temporal Work Patterns Of Programming Students, Dale E. Parson Jul 2014

Using Weka To Mine Temporal Work Patterns Of Programming Students, Dale E. Parson

Computer Science and Information Technology Faculty

Using Weka to Mine Temporal Work Patterns of Programming Students consists of notes on analyzing datasets using the Weka tool presented at the July 2014 FECS'14 Conference in Las Vegas.


A Knowledge Discovery Approach For The Detection Of Power Grid State Variable Attacks, Nathan Wallace Jul 2014

A Knowledge Discovery Approach For The Detection Of Power Grid State Variable Attacks, Nathan Wallace

Doctoral Dissertations

As the level of sophistication in power system technologies increases, the amount of system state parameters being recorded also increases. This data not only provides an opportunity for monitoring and diagnostics of a power system, but it also creates an environment wherein security can be maintained. Being able to extract relevant information from this pool of data is one of the key challenges still yet to be obtained in the smart grid. The potential exists for the creation of innovative power grid cybersecurity applications, which harness the information gained from advanced analytics. Such analytics can be based on the extraction …


Mining Branching-Time Scenarios, Dirk Fahland, David Lo, Shahar Maoz Jun 2014

Mining Branching-Time Scenarios, Dirk Fahland, David Lo, Shahar Maoz

David LO

Specification mining extracts candidate specification from existing systems, to be used for downstream tasks such as testing and verification. Specifically, we are interested in the extraction of behavior models from execution traces. In this paper we introduce mining of branching-time scenarios in the form of existential, conditional Live Sequence Charts, using a statistical data-mining algorithm. We show the power of branching scenarios to reveal alternative scenario-based behaviors, which could not be mined by previous approaches. The work contrasts and complements previous works on mining linear-time scenarios. An implementation and evaluation over execution trace sets recorded from several real-world applications shows …


Automated Library Recommendation, Ferdian Thung, David Lo, Julia Lawall Jun 2014

Automated Library Recommendation, Ferdian Thung, David Lo, Julia Lawall

David LO

Many third party libraries are available to be downloaded and used. Using such libraries can reduce development time and make the developed software more reliable. However, developers are often unaware of suitable libraries to be used for their projects and thus they miss out on these benefits. To help developers better take advantage of the available libraries, we propose a new technique that automatically recommends libraries to developers. Our technique takes as input the set of libraries that an application currently uses, and recommends other libraries that are likely to be relevant. We follow a hybrid approach that combines association …


Ranking-Based Approaches For Localizing Faults, Lucia Lucia Jun 2014

Ranking-Based Approaches For Localizing Faults, Lucia Lucia

Dissertations and Theses Collection (Open Access)

A fault is the root cause of program failures where a program behaves differently from the intended behavior. Finding or localizing faults is often laborious (especially so for complex programs), yet it is an important task in the software lifecycle. An automated technique that can accurately and quickly identify the faulty code is greatly needed to alleviate the costs of software debugging. Many fault localization techniques assume that faults are localizable, i.e., each fault manifests only in a single or a few lines of code that are close to one another. To verify this assumption, we study how faults spread …


Ar-Miner: Mining Informative Reviews For Developers From Mobile App Marketplace, Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang Jun 2014

Ar-Miner: Mining Informative Reviews For Developers From Mobile App Marketplace, Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, Boshen Zhang

Research Collection School Of Computing and Information Systems

With the popularity of smartphones and mobile devices, mobile application (a.k.a. “app”) markets have been growing exponentially in terms of number of users and downloads. App developers spend considerable effort on collecting and exploiting user feedback to improve user satisfaction, but suffer from the absence of effective user review analytics tools. To facilitate mobile app developers discover the most “informative” user reviews from a large and rapidly increasing pool of user reviews, we present “AR-Miner” — a novel computational framework for App Review Mining, which performs comprehensive analytics from raw user reviews by (i) first extracting informative user reviews by …


A Continuous Learning Strategy For Self-Organizing Maps Based On Convergence Windows, Gregory T. Breard May 2014

A Continuous Learning Strategy For Self-Organizing Maps Based On Convergence Windows, Gregory T. Breard

Senior Honors Projects

A self-organizing map (SOM) is a type of artificial neural network that has applications in a variety of fields and disciplines. The SOM algorithm uses unsupervised learning to produce a low-dimensional representation of high- dimensional data. This is done by 'fitting' a grid of nodes to a data set over a fixed number of iterations. With each iteration, the nodes of the map are adjusted so that they appear more like the data points. The low-dimensionality of the resulting map means that it can be presented graphically and be more intuitively interpreted by humans. However, it is still essential to …


Corl8: A System For Analyzing Diagnostic Measures In Wireless Sensor Networks, Loren Klingman May 2014

Corl8: A System For Analyzing Diagnostic Measures In Wireless Sensor Networks, Loren Klingman

All Theses

Due to an increasing demand to monitor the physical world, researchers are deploying wireless sensor networks more than ever before. These networks comprise a large number of sensors integrated with small, low-power wireless transceivers used to transmit data to a central processing and storage location. These devices are often deployed in harsh, volatile locations, which increases their failure rate and decreases the rate at which packets can be successfully transmitted. Existing sensor debugging tools, such as Sympathy and EmStar, rely on add-in network protocols to report status information, and to collectively diagnose network problems. Some protocols rely on a central …


Machine Learning In Wireless Sensor Networks: Algorithms, Strategies, And Applications, Mohammad Abu Alsheikh, Shaowei Lin, Dusit Niyato, Hwee-Pink Tan Apr 2014

Machine Learning In Wireless Sensor Networks: Algorithms, Strategies, And Applications, Mohammad Abu Alsheikh, Shaowei Lin, Dusit Niyato, Hwee-Pink Tan

Research Collection School Of Computing and Information Systems

Wireless sensor networks (WSNs) monitor dynamic environments that change rapidly over time. This dynamic behavior is either caused by external factors or initiated by the system designers themselves. To adapt to such conditions, sensor networks often adopt machine learning techniques to eliminate the need for unnecessary redesign. Machine learning also inspires many practical solutions that maximize resource utilization and prolong the lifespan of the network. In this paper, we present an extensive literature review over the period 2002-2013 of machine learning methods that were used to address common issues in WSNs. The advantages and disadvantages of each proposed algorithm are …


On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo Apr 2014

On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo

Research Collection School Of Computing and Information Systems

Gaming expertise is usually accumulated through playing or watching many game instances, and identifying critical moments in these game instances called turning points. Turning point rules (shorten as TPRs) are game patterns that almost always lead to some irreversible outcomes. In this paper, we formulate the notion of irreversible outcome property which can be combined with pattern mining so as to automatically extract TPRs from any given game datasets. We specifically extend the well-known PrefixSpan sequence mining algorithm by incorporating the irreversible outcome property. To show the usefulness of TPRs, we apply them to Tetris, a popular game. We mine …


On Predicting User Affiliations Using Social Features In Online Social Networks, Minh Thap Nguyen Mar 2014

On Predicting User Affiliations Using Social Features In Online Social Networks, Minh Thap Nguyen

Dissertations and Theses Collection (Open Access)

User profiling such as user affiliation prediction in online social network is a challenging task, with many important applications in targeted marketing and personalized recommendation. The research task here is to predict some user affiliation attributes that suggest user participation in different social groups.


Applicability Of Latent Dirichlet Allocation To Multi-Disk Search, George E. Noel, Gilbert L. Peterson Mar 2014

Applicability Of Latent Dirichlet Allocation To Multi-Disk Search, George E. Noel, Gilbert L. Peterson

Faculty Publications

Digital forensics practitioners face a continual increase in the volume of data they must analyze, which exacerbates the problem of finding relevant information in a noisy domain. Current technologies make use of keyword based search to isolate relevant documents and minimize false positives with respect to investigative goals. Unfortunately, selecting appropriate keywords is a complex and challenging task. Latent Dirichlet Allocation (LDA) offers a possible way to relax keyword selection by returning topically similar documents. This research compares regular expression search techniques and LDA using the Real Data Corpus (RDC). The RDC, a set of over 2400 disks from real …


A Computational Approach To Qualitative Analysis In Large Textual Datasets, Michael Evans Feb 2014

A Computational Approach To Qualitative Analysis In Large Textual Datasets, Michael Evans

Dartmouth Scholarship

In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on …


Flint International Statistics Conference Announcement, Kettering University Jan 2014

Flint International Statistics Conference Announcement, Kettering University

Flint: One City, 100 Years of Variability

CONFERENCE ANNOUNCEMENT POSTER:

Kettering University is organizing this international conference to celebrate the IYS 2013 and the 175th anniversary of the American Statistical Association.

The main focus of this conference will be on STATISTICAL METHODS & STUDIES OF HISTORICAL DATA.

Participants may use any data. Data on Flint—consisting of up to 100 years of demographic, health, labor, census and crime records will be summarized and made available to participants. Sessions will include presentations of the statistical achievements and perspectives, followed by several talks on current results.


An Urgent Precaution System To Detect Students At Risk Of Substance Abuse Through Classification Algorithms, Faruk Bulut, İhsan Ömür Bucak Jan 2014

An Urgent Precaution System To Detect Students At Risk Of Substance Abuse Through Classification Algorithms, Faruk Bulut, İhsan Ömür Bucak

Turkish Journal of Electrical Engineering and Computer Sciences

In recent years, the use of addictive drugs and substances has turned out to be a challenging social problem worldwide. The illicit use of these types of drugs and substances appears to be increasing among elementary and high school students. After becoming addicted to drugs, life becomes unbearable and gets even worse for their users. Scientific studies show that it becomes extremely difficult for an individual to break this habit after being a user. Hence, preventing teenagers from addiction becomes an important issue. This study focuses on an urgent precaution system that helps families and educators prevent teenagers from developing …


M-Fdbscan: A Multicore Density-Based Uncertain Data Clustering Algorithm, Atakan Erdem, Taflan İmre Gündem Jan 2014

M-Fdbscan: A Multicore Density-Based Uncertain Data Clustering Algorithm, Atakan Erdem, Taflan İmre Gündem

Turkish Journal of Electrical Engineering and Computer Sciences

In many data mining applications, we use a clustering algorithm on a large amount of uncertain data. In this paper, we adapt an uncertain data clustering algorithm called fast density-based spatial clustering of applications with noise (FDBSCAN) to multicore systems in order to have fast processing. The new algorithm, which we call multicore FDBSCAN (M-FDBSCAN), splits the data domain into c rectangular regions, where c is the number of cores in the system. The FDBSCAN algorithm is then applied to each rectangular region simultaneously. After the clustering operation is completed, semiclusters that occur during splitting are detected and merged to …


Discovery Of Hydrometeorological Patterns, Mete Çeli̇k, Fi̇li̇z Dadaşer Çeli̇k, Ahmet Şaki̇r Dokuz Jan 2014

Discovery Of Hydrometeorological Patterns, Mete Çeli̇k, Fi̇li̇z Dadaşer Çeli̇k, Ahmet Şaki̇r Dokuz

Turkish Journal of Electrical Engineering and Computer Sciences

Hydrometeorological patterns can be defined as meaningful and nontrivial associations between hydrological and meteorological parameters over a region. Discovering hydrometeorological patterns is important for many applications, including forecasting hydrometeorological hazards (floods and droughts), predicting the hydrological responses of ungauged basins, and filling in missing hydrological or meteorological records. However, discovering these patterns is challenging due to the special characteristics of hydrological and meteorological data, and is computationally complex due to the archival history of the datasets. Moreover, defining monotonic interest measures to quantify these patterns is difficult. In this study, we propose a new monotonic interest measure, called the hydrometeorological …


Graph Mining And Module Detection In Protein-Protein Interaction Networks, Ru Shen Jan 2014

Graph Mining And Module Detection In Protein-Protein Interaction Networks, Ru Shen

Legacy Theses & Dissertations (2009 - 2024)

Graphs are intuitive representations of relational data. Graphs have been widely used to represent biological molecular networks that operate in the living systems. In the study of systems biology, using graph mining techniques and graph-theory-based algorithms to