Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

2015

Data Mining

Discipline
Institution
Publication
Publication Type
File Type

Articles 1 - 12 of 12

Full-Text Articles in Computer Sciences

Classification And Visualization Of Crime-Related Tweets, Ransen Niu, Jiawei Zhang, David S. Ebert Aug 2015

Classification And Visualization Of Crime-Related Tweets, Ransen Niu, Jiawei Zhang, David S. Ebert

The Summer Undergraduate Research Fellowship (SURF) Symposium

Millions of Twitter posts per day can provide an insight to law enforcement officials for improved situational awareness. In this paper, we propose a natural-language-processing (NLP) pipeline towards classification and visualization of crime-related tweets. The work is divided into two parts. First, we collect crime-related tweets by classification. Unlike written text, social media like Twitter includes substantial non-standard tokens or semantics. So we focus on exploring the underlying semantic features of crime-related tweets, including parts-of-speech properties and intention verbs. Then we use these features to train a classification model via Support Vector Machine. The second part is to utilize visual …


A Study On The Efficacy Of Sentiment Analysis In Author Attribution, Michael J. Schneider Aug 2015

A Study On The Efficacy Of Sentiment Analysis In Author Attribution, Michael J. Schneider

Electronic Theses and Dissertations

The field of authorship attribution seeks to characterize an author’s writing style well enough to determine whether he or she has written a text of interest. One subfield of authorship attribution, stylometry, seeks to find the necessary literary attributes to quantify an author’s writing style. The research presented here sought to determine the efficacy of sentiment analysis as a new stylometric feature, by comparing its performance in attributing authorship against the performance of traditional stylometric features. Experimentation, with a corpus of sci-fi texts, found sentiment analysis to have a much lower performance in assigning authorship than the traditional stylometric features.


Domain Specific Document Retrieval Framework For Real-Time Social Health Data, Swapnil Soni Jul 2015

Domain Specific Document Retrieval Framework For Real-Time Social Health Data, Swapnil Soni

Kno.e.sis Publications

With the advent of the web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these online services to share and seek health real-time information has in- creased exponentially. OHIS use web search engines or microblogging search services to seek out latest, relevant as well as reliable health in- formation. When OHIS turn to microblogging search services to search real-time content, trends and breaking news, etc. the search results are not promising. Two major challenges exist in the current microblogging search engines are keyword based techniques and results do not contain real-time information. To address these challenges, …


Welcome To The Machine: Privacy And Workplace Implications Of Predictive Analytics, Robert Sprague Apr 2015

Welcome To The Machine: Privacy And Workplace Implications Of Predictive Analytics, Robert Sprague

Robert Sprague

Predictive analytics use a method known as data mining to identify trends, patterns, or relationships among data, which can then be used to develop a predictive model. Data mining itself relies upon big data, which is “big” not solely because of its size but also because its analytical potential is qualitatively different. “Big data” analysis allows organizations, including government and businesses, to combine diverse digital datasets and then use statistics and other data mining techniques to extract from them both hidden information and surprising correlations. These data are not necessarily tracking transactional records of atomized behavior, such as the purchasing …


Using Support Vector Machine Ensembles For Target Audience Classification On Twitter, Siaw Ling Lo, Raymond Chiong, David Cornforth Apr 2015

Using Support Vector Machine Ensembles For Target Audience Classification On Twitter, Siaw Ling Lo, Raymond Chiong, David Cornforth

Research Collection School Of Computing and Information Systems

The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results …


Temporal Mining For Distributed Systems, Yexi Jiang Mar 2015

Temporal Mining For Distributed Systems, Yexi Jiang

FIU Electronic Theses and Dissertations

Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent.

With the …


Sensitivity Analysis For The Winning Algorithm In Knowledge Discovery And Data Mining ( Kdd ) Cup Competition 2014, Fakhri Ghassan Abbas Mar 2015

Sensitivity Analysis For The Winning Algorithm In Knowledge Discovery And Data Mining ( Kdd ) Cup Competition 2014, Fakhri Ghassan Abbas

Theses and Dissertations

This thesis applies multi-way sensitivity analysis for the winning algorithm in the Knowledge Discovery in Data Mining (KDD) cup competition 2014 -`Predicting Excitement at Donors.org'. Because of the highly advanced nature of this competition, analyzing the winning solution under a variety of different conditions provides insight about each of the models the winning team has used in the competition. The study follows Cross Industry Standard Process (CRISP) for data mining to study the steps taken to prepare, model and evaluate the model. The thesis focuses on a gradient boosting model. After careful examination of the models created by the researchers …


Privacy Preserving Data Mining For Numerical Matrices, Social Networks, And Big Data, Lian Liu Jan 2015

Privacy Preserving Data Mining For Numerical Matrices, Social Networks, And Big Data, Lian Liu

Theses and Dissertations--Computer Science

Motivated by increasing public awareness of possible abuse of confidential information, which is considered as a significant hindrance to the development of e-society, medical and financial markets, a privacy preserving data mining framework is presented so that data owners can carefully process data in order to preserve confidential information and guarantee information functionality within an acceptable boundary.

First, among many privacy-preserving methodologies, as a group of popular techniques for achieving a balance between data utility and information privacy, a class of data perturbation methods add a noise signal, following a statistical distribution, to an original numerical matrix. With the help …


Feature Selection And Classification Methods For Decision Making: A Comparative Analysis, Osiris Villacampa Jan 2015

Feature Selection And Classification Methods For Decision Making: A Comparative Analysis, Osiris Villacampa

CCE Theses and Dissertations

The use of data mining methods in corporate decision making has been increasing in the past decades. Its popularity can be attributed to better utilizing data mining algorithms, increased performance in computers, and results which can be measured and applied for decision making. The effective use of data mining methods to analyze various types of data has shown great advantages in various application domains. While some data sets need little preparation to be mined, whereas others, in particular high-dimensional data sets, need to be preprocessed in order to be mined due to the complexity and inefficiency in mining high dimensional …


Analysis Into Developing Accurate And Efficient Intrusion Detection Approaches, Priya Rabadia, Craig Valli Jan 2015

Analysis Into Developing Accurate And Efficient Intrusion Detection Approaches, Priya Rabadia, Craig Valli

Australian Digital Forensics Conference

Cyber-security has become more prevalent as more organisations are relying on cyber-enabled infrastructures to conduct their daily actives. Subsequently cybercrime and cyber-attacks are increasing. An Intrusion Detection System (IDS) is a cyber-security tool that is used to mitigate cyber-attacks. An IDS is a system deployed to monitor network traffic and trigger an alert when unauthorised activity has been detected. It is important for IDSs to accurately identify cyber-attacks against assets on cyber-enabled infrastructures, while also being efficient at processing current and predicted network traffic flows. The purpose of the paper is to outline the importance of developing an accurate and …


Domain-Specific Document Retrieval Framework For Near Real-Time Social Health Data, Swapnil Soni Jan 2015

Domain-Specific Document Retrieval Framework For Near Real-Time Social Health Data, Swapnil Soni

Browse all Theses and Dissertations

With the advent of web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these services to share and seek health information in real-time has increased exponentially. Recently, Twitter has emerged as one of the primary mediums for sharing and seeking of the latest information related to a variety of topics, including health information. Although Twitter is an excellent information source, the identification of useful information from the deluge of tweets is one of the major challenges. Twitter search is limited to keyword-based techniques to retrieve information for a given query and sometimes the results do not …


Contrast Pattern Aided Regression And Classification, Vahid Taslimitehrani Jan 2015

Contrast Pattern Aided Regression And Classification, Vahid Taslimitehrani

Browse all Theses and Dissertations

Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy. In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where …