Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Theses/Dissertations

Big data

Institution
Publication Year
Publication
File Type

Articles 1 - 30 of 51

Full-Text Articles in Physical Sciences and Mathematics

Interposition Based Container Optimization For Data Intensive Applications, Rohan Tikmany Jul 2023

Interposition Based Container Optimization For Data Intensive Applications, Rohan Tikmany

College of Computing and Digital Media Dissertations

Reproducibility of applications is paramount in several scenarios such as collaborative work and software testing. Containers provide an easy way of addressing reproducibility by packaging the application's software and data dependencies into one executable unit, which can be executed multiple times in different environments. With the increased use of containers in industry as well as academia, current research has examined the provisioning and storage cost of containers and has shown that container deployments often include unnecessary software packages. Current methods to optimize the container size prune unnecessary data at the granularity of files and thus make binary decisions. We show …


Digital Dna: The Ethical Implications Of Big Data As The World’S New-Age Commodity, Clark H. Dotson May 2023

Digital Dna: The Ethical Implications Of Big Data As The World’S New-Age Commodity, Clark H. Dotson

Honors Theses

In the emerging digital world that we find ourselves in, it becomes apparent that data collection has become a staple of daily life, whether we like it or not. This research discussion aims to bring light to just how much one’s own digital identity is valued in the technologically-infused world of today, with distinct research and local examples to bring awareness to the ethical implications of your online presence. The paper in question examines anecdotal and research evidence of the collection of data, both through true and unjust means, as well as ethical implications of what this information truly represents. …


A Study Of The Impact Of Data Intelligence On Software Delivery Performance, Yongdong Dong Mar 2023

A Study Of The Impact Of Data Intelligence On Software Delivery Performance, Yongdong Dong

Dissertations and Theses Collection (Open Access)

With the rise of big data and artificial intelligence, data intelligence has gradually become the focus of academia and industry. Data intelligence has two obvious characteristics: big data drive and application scene drive. More and more enterprises extract valuable patterns contained in data with prediction and decision analysis methods and technologies such as large-scale data mining, machine learning and deep learning and use them to improve the management and decision in complex practice, so as to promote changes of new business modes, organizational structures and even business strategies, and improve the operational efficiency of organizations. However, there are few studies …


Compilation Optimizations To Enhance Resilience Of Big Data Programs And Quantum Processors, Travis D. Lecompte Nov 2022

Compilation Optimizations To Enhance Resilience Of Big Data Programs And Quantum Processors, Travis D. Lecompte

LSU Doctoral Dissertations

Modern computers can experience a variety of transient errors due to the surrounding environment, known as soft faults. Although the frequency of these faults is low enough to not be noticeable on personal computers, they become a considerable concern during large-scale distributed computations or systems in more vulnerable environments like satellites. These faults occur as a bit flip of some value in a register, operation, or memory during execution. They surface as either program crashes, hangs, or silent data corruption (SDC), each of which can waste time, money, and resources. Hardware methods, such as shielding or error correcting memory (ECM), …


Efficient And Scalable Triangle Centrality Algorithms In The Arkouda Framework, Joseph Thomas Patchett Aug 2022

Efficient And Scalable Triangle Centrality Algorithms In The Arkouda Framework, Joseph Thomas Patchett

Theses

Graph data structures provide a unique challenge for both analysis and algorithm development. These data structures are irregular in that memory accesses are not known a priori and accesses to these structures tend to lack locality.

Despite these challenges, graph data structures are a natural way to represent relationships between entities and to exhibit unique features about these relationships. The network created from these relationships can create unique local structures that can describe the behavior between members of these structures. Graphs can be analyzed in a number of different ways including at a high level in community detection and at …


On Performance Optimization And Prediction Of Parallel Computing Frameworks In Big Data Systems, Haifa Alquwaiee Dec 2021

On Performance Optimization And Prediction Of Parallel Computing Frameworks In Big Data Systems, Haifa Alquwaiee

Dissertations

A wide spectrum of big data applications in science, engineering, and industry generate large datasets, which must be managed and processed in a timely and reliable manner for knowledge discovery. These tasks are now commonly executed in big data computing systems exemplified by Hadoop based on parallel processing and distributed storage and management. For example, many companies and research institutions have developed and deployed big data systems on top of NoSQL databases such as HBase and MongoDB, and parallel computing frameworks such as MapReduce and Spark, to ensure timely data analyses and efficient result delivery for decision making and business …


Translation Of Array-Based Loop Programs To Optimized Sql-Based Distributed Programs, Md Hasanuzzaman Noor Dec 2021

Translation Of Array-Based Loop Programs To Optimized Sql-Based Distributed Programs, Md Hasanuzzaman Noor

Computer Science and Engineering Dissertations

Most programs written to operate on data are usually expressed in terms of array operations in sequential loops. However, these programs do not scale to large amount of data generated by scientific experiments and industrial and commercial markets. Given the success of machine learning algorithms on large amount of data and the recent shift of industries to data-driven decision making, the data scientists who are not familiar with Big Data frameworks have to rewrite the sequential programs to distributed data-parallel programs by hand. We present a novel framework, called SQLgen, that automatically translates sequential loops to distributed data-parallel programs. SQLgen …


Data-Driven Operational And Safety Analysis Of Emerging Shared Electric Scooter Systems, Qingyu Ma Dec 2021

Data-Driven Operational And Safety Analysis Of Emerging Shared Electric Scooter Systems, Qingyu Ma

Computational Modeling & Simulation Engineering Theses & Dissertations

The rapid rise of shared electric scooter (E-Scooter) systems offers many urban areas a new micro-mobility solution. The portable and flexible characteristics have made E-Scooters a competitive mode for short-distance trips. Compared to other modes such as bikes, E-Scooters allow riders to freely ride on different facilities such as streets, sidewalks, and bike lanes. However, sharing lanes with vehicles and other users tends to cause safety issues for riding E-Scooters. Conventional methods are often not applicable for analyzing such safety issues because well-archived historical crash records are not commonly available for emerging E-Scooters.

Perceiving the growth of such a micro-mobility …


Learning From Multi-Class Imbalanced Big Data With Apache Spark, William C. Sleeman Iv Jan 2021

Learning From Multi-Class Imbalanced Big Data With Apache Spark, William C. Sleeman Iv

Theses and Dissertations

With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results …


Binary Black Widow Optimization Algorithm For Feature Selection Problems, Ahmed Al-Saedi Jan 2021

Binary Black Widow Optimization Algorithm For Feature Selection Problems, Ahmed Al-Saedi

Theses and Dissertations (Comprehensive)

This thesis addresses feature selection (FS) problems, which is a primary stage in data mining. FS is a significant pre-processing stage to enhance the performance of the process with regards to computation cost and accuracy to offer a better comprehension of stored data by removing the unnecessary and irrelevant features from the basic dataset. However, because of the size of the problem, FS is known to be very challenging and has been classified as an NP-hard problem. Traditional methods can only be used to solve small problems. Therefore, metaheuristic algorithms (MAs) are becoming powerful methods for addressing the FS problems. …


Performance Optimization Of Big Data Computing Workflows For Batch And Stream Data Processing In Multi-Clouds, Huiyan Cao Dec 2020

Performance Optimization Of Big Data Computing Workflows For Batch And Stream Data Processing In Multi-Clouds, Huiyan Cao

Dissertations

Workflow techniques have been widely used as a major computing solution in many science domains. With the rapid deployment of cloud infrastructures around the globe and the economic benefits of cloud-based computing and storage services, an increasing number of scientific workflows have migrated or are in active transition to clouds. As the scale of scientific applications continues to grow, it is now common to deploy various data- and network-intensive computing workflows such as serial computing workflows, MapReduce/Spark-based workflows, and Storm-based stream data processing workflows in multi-cloud environments, where inter-cloud data transfer oftentimes plays a significant role in both workflow performance …


Improving A Wireless Localization System Via Machine Learning Techniques And Security Protocols, Zachary Yorio Dec 2020

Improving A Wireless Localization System Via Machine Learning Techniques And Security Protocols, Zachary Yorio

Masters Theses, 2020-current

The recent advancements made in Internet of Things (IoT) devices have brought forth new opportunities for technologies and systems to be integrated into our everyday life. In this work, we investigate how edge nodes can effectively utilize 802.11 wireless beacon frames being broadcast from pre-existing access points in a building to achieve room-level localization. We explain the needed hardware and software for this system and demonstrate a proof of concept with experimental data analysis. Improvements to localization accuracy are shown via machine learning by implementing the random forest algorithm. Using this algorithm, historical data can train the model and make …


Mr_Qp: A Scalable Approach To Query Processing On Arbitrary-Size Graphs Using The Map/Reduce Framework, Harshit Ashokkumar Modi May 2020

Mr_Qp: A Scalable Approach To Query Processing On Arbitrary-Size Graphs Using The Map/Reduce Framework, Harshit Ashokkumar Modi

Computer Science and Engineering Theses

The utility and widespread use of Relational Database Management Systems(RDBMSs) comes not only from its simple, easy-to-understand data model (a relation or a set) but mainly from the ability to write non-procedural queries and their optimization by the system. Queries produce exact answers that match the contents of the database. Query processing of RDBMSs has been researched for more than 4 decades and includes extensions to more complex analysis on data warehouses. In contrast, search has not been addressed by RDBMSs. As the use of other other data types (key-value store, column-store, and graphs to name a few) are becoming …


Exploring Strategies To Transition To Big Data Technologies From Dw Technologies, Mbah Johnas Fortem Jan 2020

Exploring Strategies To Transition To Big Data Technologies From Dw Technologies, Mbah Johnas Fortem

Walden Dissertations and Doctoral Studies

As a result of innovation and technological improvements, organizations are now capable of capturing and storing massive amounts of data from various sources and domains. This increase in the volume of data resulted in traditional tools used for processing, storing, and analyzing large amounts of data becoming increasingly inefficient. Grounded in the extended technology acceptance model, the purpose of this qualitative multiple case study was to explore the strategies data managers use to transition from traditional data warehousing technologies to big data technologies. The participants included data managers from 6 organizations (medium and large size) based in Munich, Germany, who …


Performance Modeling And Resource Provisioning For Data-Intensive Applications, Zhongwei Li Dec 2019

Performance Modeling And Resource Provisioning For Data-Intensive Applications, Zhongwei Li

Computer Science and Engineering Theses

Performance evaluation and resource provisioning are two most critical factors to be considered for designers of distributed systems at modern warehouse data centers. The ever-increasing volumes of data in recent years have pushed many businesses to move their computing tasks to the Cloud, which offers many benefits including the low system management and maintenance costs and better scalability. As a result, most recent prominently emerging workloads are data-intensive, calling for scaling out the workload to a large number of servers for parallel processing. Questions can be asked as what factors impact the system scaling performance, and how to efficiently schedule …


Performance Modeling And Resource Provisioning For Data-Intensive Applications, Zhongwei Li Dec 2019

Performance Modeling And Resource Provisioning For Data-Intensive Applications, Zhongwei Li

Computer Science and Engineering Dissertations

Performance evaluation and resource provisioning are two most critical factors to be considered for designers of distributed systems at modern warehouse data centers. The ever-increasing volumes of data in recent years have pushed many businesses to move their computing tasks to the Cloud, which offers many benefits including the low system management and maintenance costs and better scalability. As a result, most recent prominently emerging workloads are data-intensive, calling for scaling out the workload to a large number of servers for parallel processing. Questions can be asked as what factors impact the system scaling performance, and how to efficiently schedule …


High-Performance Computing Frameworks For Large-Scale Genome Assembly, Sayan Goswami Jun 2019

High-Performance Computing Frameworks For Large-Scale Genome Assembly, Sayan Goswami

LSU Doctoral Dissertations

Genome sequencing technology has witnessed tremendous progress in terms of throughput and cost per base pair, resulting in an explosion in the size of data. Typical de Bruijn graph-based assembly tools demand a lot of processing power and memory and cannot assemble big datasets unless running on a scaled-up server with terabytes of RAMs or scaled-out cluster with several dozens of nodes. In the first part of this work, we present a distributed next-generation sequence (NGS) assembler called Lazer, that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the …


Statistical Machine Learning Methods For Mining Spatial And Temporal Data, Fei Tan May 2019

Statistical Machine Learning Methods For Mining Spatial And Temporal Data, Fei Tan

Dissertations

Spatial and temporal dependencies are ubiquitous properties of data in numerous domains. The popularity of spatial and temporal data mining has thus grown with the increasing prevalence of massive data. The presence of spatial and temporal attributes not only provides complementary useful perspectives, but also poses new challenges to the representation and integration into the learning procedure. In this dissertation, the involved spatial and temporal dependencies are explored with three genres: sample-wise, feature-wise, and target-wise. A family of novel methodologies is developed accordingly for the dependency representation in respective scenarios.

First, dependencies among discrete, continuous and repeated observations are studied …


Google Trends Data As A Proxy For Interest In Leadership, Finley W. Walker Apr 2019

Google Trends Data As A Proxy For Interest In Leadership, Finley W. Walker

Doctor of Education (Ed.D)

The purpose of this quantitative study was to investigate the observable patterns of online search behavior in the topic of leadership using Google Trends data. Institutions have had a historically difficult time predicting good leadership candidates. Better predictions can be made by using the big data offered by groups such as Google to learn who, where, and when people are interested in leadership. The study utilized descriptive, comparative, and correlative methodologies to study Google users’ interest in leadership from 2004 to 2017. Society has placed great value into leadership throughout history, and though overall interest remains strong, it appears that …


A Data-Driven Approach For Modeling Agents, Hamdi Kavak Apr 2019

A Data-Driven Approach For Modeling Agents, Hamdi Kavak

Computational Modeling & Simulation Engineering Theses & Dissertations

Agents are commonly created on a set of simple rules driven by theories, hypotheses, and assumptions. Such modeling premise has limited use of real-world data and is challenged when modeling real-world systems due to the lack of empirical grounding. Simultaneously, the last decade has witnessed the production and availability of large-scale data from various sensors that carry behavioral signals. These data sources have the potential to change the way we create agent-based models; from simple rules to driven by data. Despite this opportunity, the literature has neglected to offer a modeling approach to generate granular agent behaviors from data, creating …


Privacy Preservation In Social Media Environments Using Big Data, Katrina Ward Jan 2019

Privacy Preservation In Social Media Environments Using Big Data, Katrina Ward

Doctoral Dissertations

"With the pervasive use of mobile devices, social media, home assistants, and smart devices, the idea of individual privacy is fading. More than ever, the public is giving up personal information in order to take advantage of what is now considered every day conveniences and ignoring the consequences. Even seemingly harmless information is making headlines for its unauthorized use (18). Among this data is user trajectory data which can be described as a user's location information over a time period (6). This data is generated whenever users access their devices to record their location, query the location of a point …


Leveraging Tiled Display For Big Data Visualization Using D3.Js, Ujjwal Acharya Aug 2018

Leveraging Tiled Display For Big Data Visualization Using D3.Js, Ujjwal Acharya

Boise State University Theses and Dissertations

Data visualization has proven effective at detecting patterns and drawing inferences from raw data by transforming it into visual representations. As data grows large, visualizing it faces two major challenges: 1) limited resolution i.e. a screen is limited to a few million pixels but the data can have a billion data points, and 2) computational load i.e. processing of this data becomes computationally challenging for a single node system. This work addresses both of these issues for efficient big data visualization. In the developed system, a High Pixel Density and Large Format display was used enabling the display of fine …


Deep Data Locality On Apache Hadoop, Sungchul Lee May 2018

Deep Data Locality On Apache Hadoop, Sungchul Lee

UNLV Theses, Dissertations, Professional Papers, and Capstones

The amount of data being collected in various areas such as social media, network, scientific instrument, mobile devices, and sensors is growing continuously, and the technology to process them is also advancing rapidly. One of the fundamental technologies to process big data is Apache Hadoop that has been adopted by many commercial products, such as InfoSphere by IBM, or Spark by Cloudera. MapReduce on Hadoop has been widely used in many data science applications. As a dominant big data processing platform, the performance of MapReduce on Hadoop system has a significant impact on the big data processing capability across multiple …


Secure Multiparty Protocol For Differentially-Private Data Release, Anthony Harris May 2018

Secure Multiparty Protocol For Differentially-Private Data Release, Anthony Harris

Boise State University Theses and Dissertations

In the era where big data is the new norm, a higher emphasis has been placed on models which guarantees the release and exchange of data. The need for privacy-preserving data arose as more sophisticated data-mining techniques led to breaches of sensitive information. In this thesis, we present a secure multiparty protocol for the purpose of integrating multiple datasets simultaneously such that the contents of each dataset is not revealed to any of the data owners, and the contents of the integrated data do not compromise individual’s privacy. We utilize privacy by simulation to prove that the protocol is privacy-preserving, …


Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao Apr 2018

Efficient Reduced Bias Genetic Algorithm For Generic Community Detection Objectives, Aditya Karnam Gururaj Rao

Theses

The problem of community structure identification has been an extensively investigated area for biology, physics, social sciences, and computer science in recent years for studying the properties of networks representing complex relationships. Most traditional methods, such as K-means and hierarchical clustering, are based on the assumption that communities have spherical configurations. Lately, Genetic Algorithms (GA) are being utilized for efficient community detection without imposing sphericity. GAs are machine learning methods which mimic natural selection and scale with the complexity of the network. However, traditional GA approaches employ a representation method that dramatically increases the solution space to be searched by …


Supporting Big Data At The Vehicular Edge, Lloyd Decker Apr 2018

Supporting Big Data At The Vehicular Edge, Lloyd Decker

Computer Science Theses & Dissertations

Vehicular networks are commonplace, and many applications have been developed to utilize their sensor and computing resources. This is a great utilization of these resources as long as they are mobile. The question to ask is whether these resources could be put to use when the vehicle is not mobile. If the vehicle is parked, the resources are simply dormant and waiting for use. If the vehicle has a connection to a larger computing infrastructure, then it can put its resources towards that infrastructure. With enough vehicles interconnected, there exists a computing environment that could handle many cloud-based application services. …


Assessment Of Factors Influencing Intent-To-Use Big Data Analytics In An Organization: A Survey Study, Wayne Madhlangobe Jan 2018

Assessment Of Factors Influencing Intent-To-Use Big Data Analytics In An Organization: A Survey Study, Wayne Madhlangobe

CCE Theses and Dissertations

The central question was how the relationship between trust-in-technology and intent-to-use Big Data Analytics in an organization is mediated by both Perceived Risk and Perceived Usefulness. Big Data Analytics is quickly becoming a critically important driver for business success. Many organizations are increasing their Information Technology budgets on Big Data Analytics capabilities. Technology Acceptance Model stands out as a critical theoretical lens primarily due to its assessment approach and predictive explanatory capacity to explain individual behaviors in the adoption of technology. Big Data Analytics use in this study was considered a voluntary act, therefore, well aligned with the Theory of …


Analytic Extensions To The Data Model For Management Analytics And Decision Support In The Big Data Environment, Nsikak Etim Akpakpan Jan 2018

Analytic Extensions To The Data Model For Management Analytics And Decision Support In The Big Data Environment, Nsikak Etim Akpakpan

Walden Dissertations and Doctoral Studies

From 2006 to 2016, an estimated average of 50% of big data analytics and decision support projects failed to deliver acceptable and actionable outputs to business users. The resulting management inefficiency came with high cost, and wasted investments estimated at $2.7 trillion in 2016 for companies in the United States. The purpose of this quantitative descriptive study was to examine the data model of a typical data analytics project in a big data environment for opportunities to improve the information created for management problem-solving. The research questions focused on finding artifacts within enterprise data to model key business scenarios for …


Offline And Online Density Estimation For Large High-Dimensional Data, Aref Majdara Jan 2018

Offline And Online Density Estimation For Large High-Dimensional Data, Aref Majdara

Dissertations, Master's Theses and Master's Reports

Density estimation has wide applications in machine learning and data analysis techniques including clustering, classification, multimodality analysis, bump hunting and anomaly detection. In high-dimensional space, sparsity of data in local neighborhood makes many of parametric and nonparametric density estimation methods mostly inefficient.

This work presents development of computationally efficient algorithms for high-dimensional density estimation, based on Bayesian sequential partitioning (BSP). Copula transform is used to separate the estimation of marginal and joint densities, with the purpose of reducing the computational complexity and estimation error. Using this separation, a parallel implementation of the density estimation algorithm on a 4-core CPU is …


Visual Logging Framework Using Elk Stack, Ravi Nishant Dec 2017

Visual Logging Framework Using Elk Stack, Ravi Nishant

Computer Science and Engineering Theses

Logging is the process of storing information for future reference and audit purposes. In software applications, logging plays a very critical role as a development utility and ensures code quality. It acts as an enabler for developers and support professionals by providing them capability to see application’s functionality and understand any issues with it. Data logging has a widespread use in scientific experiments and analytical systems. Major systems which heavily uses data logging are weather reporting services, digital advertisement, search engines, space exploration systems to name a few. Although, data logging increases the productivity and efficiency of a software system, …