Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Data mining

2020

Discipline
Institution
Publication
Publication Type

Articles 1 - 17 of 17

Full-Text Articles in Computer Sciences

Hierarchical Aggregation Of Multidimensional Data For Efficient Data Mining, Safaa Khalil Alwajidi Dec 2020

Hierarchical Aggregation Of Multidimensional Data For Efficient Data Mining, Safaa Khalil Alwajidi

Dissertations

Big data analysis is essential for many smart applications in areas such as connected healthcare, intelligent transportation, human activity recognition, environment, and climate change monitoring. Traditional data mining algorithms do not scale well to big data due to the enormous number of data points and the velocity of their generation. Mining and learning from big data need time and memory efficiency techniques, albeit the cost of possible loss in accuracy. This research focuses on the mining of big data using aggregated data as input. We developed a data structure that is to be used to aggregate data at multiple resolutions. …


Structured Data Mining Networks, Time Series, And Time Series Of Networks, Lin Zhang Dec 2020

Structured Data Mining Networks, Time Series, And Time Series Of Networks, Lin Zhang

Legacy Theses & Dissertations (2009 - 2024)

The rate at which data is generated in modern applications has created an unprecedented demand for novel methods to effectively and efficiently extract insightful patterns. Methods aware of known domain-specific structure in the data tend to be advantageous. In particular, a joint temporal and networked view of observations offers a holistic lens to many real-world systems. Example domains abound: activity of social network users, gene interactions over time, a temporal load of infrastructure networks, and others. Existing analysis and mining approaches for such data exhibit limited quality and scalability due to their sensitivity to noise, missing observations, and the need …


Hybrid Deep Neural Networks For Mining Heterogeneous Data, Xiurui Hou Aug 2020

Hybrid Deep Neural Networks For Mining Heterogeneous Data, Xiurui Hou

Dissertations

In the era of big data, the rapidly growing flood of data represents an immense opportunity. New computational methods are desired to fully leverage the potential that exists within massive structured and unstructured data. However, decision-makers are often confronted with multiple diverse heterogeneous data sources. The heterogeneity includes different data types, different granularities, and different dimensions, posing a fundamental challenge in many applications. This dissertation focuses on designing hybrid deep neural networks for modeling various kinds of data heterogeneity.

The first part of this dissertation concerns modeling diverse data types, the first kind of data heterogeneity. Specifically, image data and …


Mobile Location Data Analytics, Privacy, And Security, Yunhe Feng Aug 2020

Mobile Location Data Analytics, Privacy, And Security, Yunhe Feng

Doctoral Dissertations

Mobile location data are ubiquitous in the digital world. People intentionally and unintentionally generate numerous location data when connecting to cellular networks or sharing posts on social networks. As mobile devices normally choose to communicate with nearby cell towers outdoor, it is reasonable to infer human locations based on cell tower coordinates. Many social networking platforms, such as Twitter, allow users to geo-tag their posts optionally, publishing personal locations to friends or everyone. These location data are particularly useful for understanding mobile usage behaviors and human mobility patterns. Meanwhile, the public expresses great concern about the privacy and security of …


Detecting Undisclosed Paid Editing In Wikipedia, Nikesh Joshi Aug 2020

Detecting Undisclosed Paid Editing In Wikipedia, Nikesh Joshi

Boise State University Theses and Dissertations

Wikipedia is a free and open-collaboration based online encyclopedia. The website has millions of pages that are maintained by thousands of volunteer editors. It is part of Wikipedia’s fundamental principles that pages are written with a neutral point of view and are maintained by volunteer editors for free with well-defined guidelines in order to avoid or disclose any conflict of interest. However, there have been several known incidents where editors intentionally violate such guidelines in order to get paid (or even extort money) for maintaining promotional spam articles without disclosing such information.

This thesis addresses for the first time the …


Prevalence, Contents And Automatic Detection Of Kl-Satd, Leevi Rantala, Mika Mantyla, David Lo Aug 2020

Prevalence, Contents And Automatic Detection Of Kl-Satd, Leevi Rantala, Mika Mantyla, David Lo

Research Collection School Of Computing and Information Systems

When developers use different keywords such as TODO and FIXME in source code comments to describe self-admitted technical debt (SATD), we refer it as Keyword-Labeled SATD (KL-SATD). We study KL-SATD from 33 software repositories with 13,588 KL-SATD comments. We find that the median percentage of KL-SATD comments among all comments is only 1,52%. We find that KL-SATD comment contents include words expressing code changes and uncertainty, such as remove, fix, maybe and probably. This makes them different compared to other comments. KL-SATD comment contents are similar to manually labeled SATD comments of prior work. Our machine learning classifier using logistic …


Joint Lattice Of Reconstructability Analysis And Bayesian Network General Graphs, Marcus Harris, Martin Zwick Jul 2020

Joint Lattice Of Reconstructability Analysis And Bayesian Network General Graphs, Marcus Harris, Martin Zwick

Systems Science Faculty Publications and Presentations

This paper integrates the structures considered in Reconstructability Analysis (RA) and those considered in Bayesian Networks (BN) into a joint lattice of probabilistic graphical models. This integration and associated lattice visualizations are done in this paper for four variables, but the approach can easily be expanded to more variables. The work builds on the RA work of Klir (1985), Krippendorff (1986), and Zwick (2001), and the BN work of Pearl (1985, 1987, 1988, 2000), Verma (1990), Heckerman (1994), Chickering (1995), Andersson (1997), and others. The RA four variable lattice and the BN four variable lattice partially overlap: there are ten …


Reconstructability Analysis & Its Occam Implementation, Martin Zwick Jul 2020

Reconstructability Analysis & Its Occam Implementation, Martin Zwick

Systems Science Faculty Publications and Presentations

This talk will describe Reconstructability Analysis (RA), a probabilistic graphical modeling methodology deriving from the 1960s work of Ross Ashby and developed in the systems community in the 1980s and afterwards. RA, based on information theory and graph theory, resembles and partially overlaps Bayesian networks (BN) and log-linear techniques, but also has some unique capabilities. (A paper explaining the relationship between RA and BN will be given in this special session.) RA is designed for exploratory modeling although it can also be used for confirmatory hypothesis testing. In RA modeling, one either predicts some DV from a set of IVs …


Probabilistic Value Selection For Space Efficient Model, Gunarto Sindoro Njoo, Baihua Zheng, Kuo-Wei Hsu, Wen-Chih Peng Jul 2020

Probabilistic Value Selection For Space Efficient Model, Gunarto Sindoro Njoo, Baihua Zheng, Kuo-Wei Hsu, Wen-Chih Peng

Research Collection School Of Computing and Information Systems

An alternative to current mainstream preprocessing methods is proposed: Value Selection (VS). Unlike the existing methods such as feature selection that removes features and instance selection that eliminates instances, value selection eliminates the values (with respect to each feature) in the dataset with two purposes: reducing the model size and preserving its accuracy. Two probabilistic methods based on information theory's metric are proposed: PVS and P + VS. Extensive experiments on the benchmark datasets with various sizes are elaborated. Those results are compared with the existing preprocessing methods such as feature selection, feature transformation, and instance selection methods. Experiment results …


Detecting Credit Card Fraud: An Analysis Of Fraud Detection Techniques, William Lovo May 2020

Detecting Credit Card Fraud: An Analysis Of Fraud Detection Techniques, William Lovo

Senior Honors Projects, 2020-current

Advancements in the modern age have brought many conveniences, one of those being credit cards. Providing an individual the ability to hold their entire purchasing power in the form of pocket-sized plastic cards have made credit cards the preferred method to complete financial transactions. However, these systems are not infallible and may provide criminals and other bad actors the opportunity to abuse them. Financial institutions and their customers lose billions of dollars every year to credit card fraud. To combat this issue, fraud detection systems are deployed to discover fraudulent activity after they have occurred. Such systems rely on advanced …


A Studying Of Webcontent Mining Tools, Rasha Hani Salman, Mahmood Zaki, Nadia A. Shiltag Apr 2020

A Studying Of Webcontent Mining Tools, Rasha Hani Salman, Mahmood Zaki, Nadia A. Shiltag

Al-Qadisiyah Journal of Pure Science

The web today has become an archive of information in any structure such content, sound, video, designs, and multimedia, with the progression of time overall web, the world wide web is now crowded with different data making extraction of virtual data burdensome process, web utilizes various information mining strategies to mine helpful information from page substance and web hyperlink. The fundamental employments of web content mining are to gather, sort out, classify, providing the best data accessible on the web for the client who needs to get it. The WCM tools are needful to examining some HTML reports, content and …


Developing Big Data Projects In Open University Engineering Courses: Lessons Learned, Juan A. Lara, Aurea Anguera De Sojo, Shadi Aljawarneh, Robert P. Schumaker, Bassam Al-Shargabi Feb 2020

Developing Big Data Projects In Open University Engineering Courses: Lessons Learned, Juan A. Lara, Aurea Anguera De Sojo, Shadi Aljawarneh, Robert P. Schumaker, Bassam Al-Shargabi

Computer Science Faculty Publications and Presentations

Big Data courses in which students are asked to carry out Big Data projects are becoming more frequent as a part of University Engineering curriculum. In these courses, instructors and students must face a series of special characteristics, difficulties and challenges that it is important to know about beforehand, so the lecturer can better plan the subject and manage the teaching methods in order to prevent students' academic dropout and low performance. The goal of this research is to approach this problem by sharing the lessons learned in the process of teaching e-learning courses where students are required to develop …


A Framework For Online Social Network Volatile Data Analysis: A Case For The Fast Fashion Industry, Anoud Bani-Hani, Feras Al-Obeidat, Elhadj Benkhelifa, Oluwasegun Adedugbe Jan 2020

A Framework For Online Social Network Volatile Data Analysis: A Case For The Fast Fashion Industry, Anoud Bani-Hani, Feras Al-Obeidat, Elhadj Benkhelifa, Oluwasegun Adedugbe

All Works

Consumer satisfaction is an important part for any business as it has been shown to be a major factor for consumer loyalty. Identifying satisfaction in products is also important as it allows businesses alter production plans based on the level of consumer satisfaction for a product. With consumer satisfaction data being very volatile for some products due to a short requirement period for such products, current consumer satisfaction must be identified within a shorter period before the data becomes obsolete. The fast fashion industry, which is part of the fashion industry, is adopted as a case study in this research. …


Analysis And Optimization Of Combustion Characteristics Of Cement Kiln Cooperatively Disposing Domestic Refuse, Jingbing Wu, Hanqing Tang, Xu Jun Jan 2020

Analysis And Optimization Of Combustion Characteristics Of Cement Kiln Cooperatively Disposing Domestic Refuse, Jingbing Wu, Hanqing Tang, Xu Jun

Journal of System Simulation

Abstract: Because the traditional methods can hardly analyze the complex combustion characteristics of cement kiln mixed with domestic refuse, a data mining technology is introduced. A domestic cement plant is selected as the object, and its operating data and relevant parameters are collected. The influence coefficient of each parameter on coal consumption and NOx emission is analyzed by using Stability Selection algorithm. The mathematical model of coal consumption and NOx emission is established with Random Forest algorithm, and the key optimization parameters and their optimal values are obtained by K-means clustering algorithm. The result shows that this method …


Asking Questions Is Easy, Asking Great Questions Is Hard: Constructing Effective Stack Overflow Questions, Jane W. Hsieh Jan 2020

Asking Questions Is Easy, Asking Great Questions Is Hard: Constructing Effective Stack Overflow Questions, Jane W. Hsieh

Honors Papers

This paper explores and seeks to improve the ways in which Stack Overflow question posts can elicit answers. Using statistical data analysis approaches and reviews of existing literature, we pin- point three key factors that are found in many previously success- ful/answerable questions. We then present a prototypical sidebar for the ask page that leverages these factors to dynamically (1) evaluate the quality of questions in construction (2) display answer previews of relevant questions and (3) scaffold the identified factors to subsequent askers during their question development processes.


Multitask-Based Association Rule Mining, Peli̇n Yildirim Taşer, Kökten Ulaş Bi̇rant, Derya Bi̇rant Jan 2020

Multitask-Based Association Rule Mining, Peli̇n Yildirim Taşer, Kökten Ulaş Bi̇rant, Derya Bi̇rant

Turkish Journal of Electrical Engineering and Computer Sciences

Recently, there has been a growing interest in association rule mining (ARM) in various fields. However, standard ARM algorithms fail to discover rules for multitask problems as they do not consider task-oriented investigation and, therefore, they ignore the correlation among the tasks. Considering this situation, this paper proposes a novel algorithm, named multitask association rule miner (MTARM), that tends to jointly discover rules by considering multiple tasks. This paper also introduces two novel concepts: single-task rule and multiple-task rule. In the first phase of the proposed approach, highly frequent local rules (single-task rules) are explored for each task separately and …


Searching For Needles In The Cosmic Haystack, Thomas Ryan Devine Jan 2020

Searching For Needles In The Cosmic Haystack, Thomas Ryan Devine

Graduate Theses, Dissertations, and Problem Reports

Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best …