Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

2020

Data Science

Institution
Keyword
Publication
Publication Type
File Type

Articles 1 - 30 of 223

Full-Text Articles in Physical Sciences and Mathematics

Distributed Load Testing By Modeling And Simulating User Behavior, Chester Ira Parrott Dec 2020

Distributed Load Testing By Modeling And Simulating User Behavior, Chester Ira Parrott

LSU Doctoral Dissertations

Modern human-machine systems such as microservices rely upon agile engineering practices which require changes to be tested and released more frequently than classically engineered systems. A critical step in the testing of such systems is the generation of realistic workloads or load testing. Generated workload emulates the expected behaviors of users and machines within a system under test in order to find potentially unknown failure states. Typical testing tools rely on static testing artifacts to generate realistic workload conditions. Such artifacts can be cumbersome and costly to maintain; however, even model-based alternatives can prevent adaptation to changes in a system …


Data: The Good, The Bad And The Ethical, John D. Kelleher, Filipe Cabral Pinto, Luis M. Cortesao Dec 2020

Data: The Good, The Bad And The Ethical, John D. Kelleher, Filipe Cabral Pinto, Luis M. Cortesao

Articles

It is often the case with new technologies that it is very hard to predict their long-term impacts and as a result, although new technology may be beneficial in the short term, it can still cause problems in the longer term. This is what happened with oil by-products in different areas: the use of plastic as a disposable material did not take into account the hundreds of years necessary for its decomposition and its related long-term environmental damage. Data is said to be the new oil. The message to be conveyed is associated with its intrinsic value. But as in …


Analysis And Implementation Of The Maximum Likelihood Expectation Maximization Algorithm For Find, Angus Boyd Jameson Dec 2020

Analysis And Implementation Of The Maximum Likelihood Expectation Maximization Algorithm For Find, Angus Boyd Jameson

Student Research Projects

This thesis presents an organized explanation and breakdown of the Maximum Likelihood Expectation Maximization image reconstruction algorithm. This background research was used to develop a means of implementing the algorithm into the imaging code for UNH's Field Deployable Imaging Neutron Detector to improve its ability to resolve complex neutron sources. This thesis provides an overview for this implementation scheme, and include the results of a couple of reconstruction tests for the algorithm. A discussion is given on the current state of the algorithm and its integration with the neutron detector system, and suggestions are given for how the work and …


Bayesian Semi-Supervised Keyphrase Extraction And Jackknife Empirical Likelihood For Assessing Heterogeneity In Meta-Analysis, Guanshen Wang Dec 2020

Bayesian Semi-Supervised Keyphrase Extraction And Jackknife Empirical Likelihood For Assessing Heterogeneity In Meta-Analysis, Guanshen Wang

Statistical Science Theses and Dissertations

This dissertation investigates: (1) A Bayesian Semi-supervised Approach to Keyphrase Extraction with Only Positive and Unlabeled Data, (2) Jackknife Empirical Likelihood Confidence Intervals for Assessing Heterogeneity in Meta-analysis of Rare Binary Events.

In the big data era, people are blessed with a huge amount of information. However, the availability of information may also pose great challenges. One big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyphrase extraction methods summarize an article by identifying a list of keyphrases. Many existing keyphrase extraction methods focus on the unsupervised setting, …


Analysis Of Github Pull Requests, Canon Ellis Dec 2020

Analysis Of Github Pull Requests, Canon Ellis

Computer Science and Engineering Theses and Dissertations

The popularity of the software repository site GitHub has created a rise in the Pull Based Development Models' use. An essential portion of pull-based development is the creation of Pull Requests. Pull Requests often have to be reviewed by an individual to be approved and accepted into the Master branch of a software repository. The reviewing process can often be time-consuming and introduce a relatively high level of lost development time. This paper examines thousands of pull requests to understand the most valuable metadata of pull requests. We then introduce metrics in comparing the metadata of pull requests to understand …


Designing Surveys On Youth Immigration Reform: Lessons From The 2016 Cces Anomaly, Saige Calkins Dec 2020

Designing Surveys On Youth Immigration Reform: Lessons From The 2016 Cces Anomaly, Saige Calkins

Masters Theses

Even with clear advantages to using internet based survey research, there are still some uncertainties to which survey methods are most conducive to an online platform. Most survey method literature, whether focusing on online, telephone, or in-person formats, tend to observe little to no differences between using various survey modes and survey results. Despite this, there is little research focused on the interaction effect between survey formatting, in terms of design and framing, and public opinion on social issues, specifically child immigration policies - a recent topic of popular debate. This paper examines an anomalous result found within the 2016 …


Machine Learning Model Selection For Predicting Global Bathymetry, Nicholas P. Moran Dec 2020

Machine Learning Model Selection For Predicting Global Bathymetry, Nicholas P. Moran

University of New Orleans Theses and Dissertations

This work is concerned with the viability of Machine Learning (ML) in training models for predicting global bathymetry, and whether there is a best fit model for predicting that bathymetry. The desired result is an investigation of the ability for ML to be used in future prediction models and to experiment with multiple trained models to determine an optimum selection. Ocean features were aggregated from a set of external studies and placed into two minute spatial grids representing the earth's oceans. A set of regression models, classification models, and a novel classification model were then fit to this data and …


Automatically Classifying Non-Functional Requirements With Feature Extraction And Supervised Machine Learning Techniques, Mahtab Ezzatikarami Dec 2020

Automatically Classifying Non-Functional Requirements With Feature Extraction And Supervised Machine Learning Techniques, Mahtab Ezzatikarami

Electronic Thesis and Dissertation Repository

Abstract. Context and Motivation: Non-functional requirements (NFRs) of a system need to be classified into different types such as usability, performance, etc. This would enable stakeholders to ensure the completeness of their work by extracting specific NFRs related to their expertise. Question/Problem: Because of the size and complexity of requirement specification documents, the manual classification of NFRs is time-consuming, labour-intensive, and error-prone. We thus need an automated solution that can provide a highly accurate and efficient categorization of NFRs. Principal ideas/results: In this investigation, using natural language processing and supervised machine learning (SML) techniques, we investigate with feature extraction techniques …


Classifying Imbalanced Financial Fraud Data Utilizing Enhanced Random Forest Algorithm, Charles Gardner Dec 2020

Classifying Imbalanced Financial Fraud Data Utilizing Enhanced Random Forest Algorithm, Charles Gardner

Master of Science in Computer Science Theses

Imbalanced datasets have been a unique challenge for machine learning, requiring specialized approaches to correctly classify the minority class. Financial fraud detection involves using highly imbalanced datasets with a class imbalance of up to .01% frauds to 99.99% regular transactions. It is essential to identify all frauds in financial fraud detection, even if some classifications' precision is low. I developed a random forest assembly that separates fraudulent transactions into tiers of precision. With this approach, 96% of fraudulent transactions are identified, showing an 8% increase in recall when compared to standard approaches. 59% of fraud classifications' precision increases by 10% …


Reading Pdfs Using Adversarially Trained Convolutional Neural Network Based Optical Character Recognition, Michael B. Brewer, Michael Catalano, Yat Leung, David Stroud Dec 2020

Reading Pdfs Using Adversarially Trained Convolutional Neural Network Based Optical Character Recognition, Michael B. Brewer, Michael Catalano, Yat Leung, David Stroud

SMU Data Science Review

A common problem that has plagued companies for years is digitizing documents and making use of the data contained within. Optical Character Recognition (OCR) technology has flooded the market, but companies still face challenges productionizing these solutions at scale. Although these technologies can identify and recognize the text on the page, they fail to classify the data to the appropriate datatype in an automated system that uses OCR technology as its data mining process. The research contained in this paper presents a novel framework for the identification of datapoints on check stub images by utilizing generative adversarial networks (GANs) to …


Principal Component Analysis For Predicting The Party Of The Legislators, Afsana Mimi Dec 2020

Principal Component Analysis For Predicting The Party Of The Legislators, Afsana Mimi

Publications and Research

In Spring 2020, I did a project, "Decision Tree Predicting the Party of Legislators," and construct a decision tree model to predict legislators' parties' based on their votes. We also use this model to identify legislators who frequently voted against their parties. We used the legislators' roll call votes, Office of Clerk U.S. House of Representatives Data Sets (Categorical values) collected in 2018 and 2019. In this new project, We study the 2018 and 2019 vote data using Principal Component Analysis (PCA). The goal is to find a (compressed) model using unsupervised learning to distinguish the legislators' parties, and PCA …


Data Science In The Time Of Covid-19, Tony Breitzman Dec 2020

Data Science In The Time Of Covid-19, Tony Breitzman

Faculty Scholarship for the College of Science & Mathematics

No abstract provided.


Introduction To Data Science Lti 110, Joanna Burkhardt Dec 2020

Introduction To Data Science Lti 110, Joanna Burkhardt

Library Impact Statements

No abstract provided.


Introduction To Data Science, Joanna Burkhardt Dec 2020

Introduction To Data Science, Joanna Burkhardt

Library Impact Statements

No abstract provided.


Spatial Frequency Implications For Global And Local Processing In Autistic Children, Riya Mody, Ayra Tusneem, Louanne Boyd, Vincent Berardi Dec 2020

Spatial Frequency Implications For Global And Local Processing In Autistic Children, Riya Mody, Ayra Tusneem, Louanne Boyd, Vincent Berardi

Student Scholar Symposium Abstracts and Posters

Visual processing in humans is done by integrating and updating multiple streams of global and local sensory input. Interaction between these two systems can be disrupted in individuals with ASD and other learning disabilities. When this integration is not done smoothly, it becomes difficult to see the “big picture”, which has been found to have implications on emotion recognition, social skills, and conversation skills. An example of this phenomenon is local interference, which is when local details are prioritized over the global features. Previous research in this field has aimed to decrease local interference by developing and evaluating a filter …


Factors Affecting Computer Science Research Productivity And Impact In Nigeria: A Bibliometric Evidence, Azubuike Ezenwoke Dec 2020

Factors Affecting Computer Science Research Productivity And Impact In Nigeria: A Bibliometric Evidence, Azubuike Ezenwoke

Library Philosophy and Practice (e-journal)

Computer science is a burgeoning research field and has the potential to accelerate the rate of industrialisation and subsequently, economic development. Using bibliometric data obtained from Scopus, this study employed a 15-year bibliometric analysis to highlight Nigeria’s productivity and impact trends in the computer science research landscape. Our findings are summarised as follows: First, Nigeria’s computer science research contribution and citations are meager in comparison to the global output. Secondly, international collaboration is generally weak as most collaborations are national in scope. Third, Nigeria’s computer science-related research is published in low-quality outlets, as Scopus has discontinued the indexing of most …


Nonparametric Bayesian Deep Learning For Scientific Data Analysis, Devanshu Agrawal Dec 2020

Nonparametric Bayesian Deep Learning For Scientific Data Analysis, Devanshu Agrawal

Doctoral Dissertations

Deep learning (DL) has emerged as the leading paradigm for predictive modeling in a variety of domains, especially those involving large volumes of high-dimensional spatio-temporal data such as images and text. With the rise of big data in scientific and engineering problems, there is now considerable interest in the research and development of DL for scientific applications. The scientific domain, however, poses unique challenges for DL, including special emphasis on interpretability and robustness. In particular, a priority of the Department of Energy (DOE) is the research and development of probabilistic ML methods that are robust to overfitting and offer reliable …


Exploration Of Mid To Late Paleozoic Tectonics Along The Cincinnati Arch Using Gis And Python To Automate Geologic Data Extraction From Disparate Sources, Kenneth Steven Boling Dec 2020

Exploration Of Mid To Late Paleozoic Tectonics Along The Cincinnati Arch Using Gis And Python To Automate Geologic Data Extraction From Disparate Sources, Kenneth Steven Boling

Doctoral Dissertations

Structure contour maps are one of the most common methods of visualizing geologic horizons as three-dimensional surfaces. In addition to their practical applications in the oil and gas and mining industries, these maps can be used to evaluate the relationships of different geologic units in order to unravel the tectonic history of an area. The construction of high-resolution regional structure contour maps of a particular geologic horizon requires a significant volume of data that must be compiled from all available surface and subsurface sources. Processing these data using conventional methods and even basic GIS tools can be tedious and very …


Healthcare Regulation And Governance: Big Data Analytics And Healthcare Data Protection, Xuejuan Zhang Dec 2020

Healthcare Regulation And Governance: Big Data Analytics And Healthcare Data Protection, Xuejuan Zhang

School of Continuing and Professional Studies Student Papers

No abstract provided.


Evaluating The Reproducibility Of Physiological Stress Detection Models, Varun Mishra, Sougata Sen, Grace Chen, Tian Hao, Jeffrey Rogers, Ching-Hua Chen, David Kotz Dec 2020

Evaluating The Reproducibility Of Physiological Stress Detection Models, Varun Mishra, Sougata Sen, Grace Chen, Tian Hao, Jeffrey Rogers, Ching-Hua Chen, David Kotz

Dartmouth Scholarship

Recent advances in wearable sensor technologies have led to a variety of approaches for detecting physiological stress. Even with over a decade of research in the domain, there still exist many significant challenges, including a near-total lack of reproducibility across studies. Researchers often use some physiological sensors (custom-made or off-the-shelf), conduct a study to collect data, and build machine-learning models to detect stress. There is little effort to test the applicability of the model with similar physiological data collected from different devices, or the efficacy of the model on data collected from different studies, populations, or demographics.

This paper takes …


Detecting Hacker Threats: Performance Of Word And Sentence Embedding Models In Identifying Hacker Communications, Susan Mckeever, Brian Keegan, Andrei Quieroz Dec 2020

Detecting Hacker Threats: Performance Of Word And Sentence Embedding Models In Identifying Hacker Communications, Susan Mckeever, Brian Keegan, Andrei Quieroz

Conference papers

Abstract—Cyber security is striving to find new forms of protection against hacker attacks. An emerging approach nowadays is the investigation of security-related messages exchanged on deep/dark web and even surface web channels. This approach can be supported by the use of supervised machine learning models and text mining techniques. In our work, we compare a variety of machine learning algorithms, text representations and dimension reduction approaches for the detection accuracies of software-vulnerability-related communications. Given the imbalanced nature of the three public datasets used, we investigate appropriate sampling approaches to boost detection accuracies of our models. In addition, we examine how …


Unifying Chemistry And Machine Learning For The Study Of Noncovalent Interactions, Jacob A. Townsend Dec 2020

Unifying Chemistry And Machine Learning For The Study Of Noncovalent Interactions, Jacob A. Townsend

Doctoral Dissertations

Gas separations are in great demand for carbon emission reduction, natural gas purification, oxygen isolation, and much more. Many of these separations rely on cost-prohibitive methods such as cryogenic distillation or strong-binding solvents. As a result, novel materials are being developed to subvert the energetic expense of gas separation processes. These studies focus on improving the performance of alternative materials, including (but not limited to) metal-organic frameworks, covalent organic frameworks, dense polymeric membranes, porous polymers, and ionic liquids.

In this work, the atomistic effects of functional units are explored for gas separations processes using electronic structure theory and machine learning. …


Hierarchical Aggregation Of Multidimensional Data For Efficient Data Mining, Safaa Khalil Alwajidi Dec 2020

Hierarchical Aggregation Of Multidimensional Data For Efficient Data Mining, Safaa Khalil Alwajidi

Dissertations

Big data analysis is essential for many smart applications in areas such as connected healthcare, intelligent transportation, human activity recognition, environment, and climate change monitoring. Traditional data mining algorithms do not scale well to big data due to the enormous number of data points and the velocity of their generation. Mining and learning from big data need time and memory efficiency techniques, albeit the cost of possible loss in accuracy. This research focuses on the mining of big data using aggregated data as input. We developed a data structure that is to be used to aggregate data at multiple resolutions. …


Computational Behavioral Analytics: Estimating Psychological Traits In Foreign Languages., Kristopher Wayne Reese Dec 2020

Computational Behavioral Analytics: Estimating Psychological Traits In Foreign Languages., Kristopher Wayne Reese

Electronic Theses and Dissertations

The rise of technology proliferating into the workplace has increased the threat of loss of intellectual property, classified, and proprietary information for companies, governments, and academics. This can cause economic damage to the creators of new IP, companies, and whole economies. This technology proliferation has also assisted terror groups and lone wolf actors in pushing their message to a larger audience or finding similar tribal groups that share common, sometimes flawed, beliefs across various social media platforms. These types of challenges have created numerous studies in psycholinguistics, as well as commercial tools, that look to assist in identifying potential threats …


Open Data, Collaborative Working Platforms, And Interdisciplinary Collaboration: Building An Early Career Scientist Community Of Practice To Leverage Ocean Observatories Initiative Data To Address Critical Questions In Marine Science, Robert M. Levine, Kristen E. Fogaren, Johna E. Rudzin, Christopher J. Russoniello, Dax C. Soule, Justine M. Whitaker Dec 2020

Open Data, Collaborative Working Platforms, And Interdisciplinary Collaboration: Building An Early Career Scientist Community Of Practice To Leverage Ocean Observatories Initiative Data To Address Critical Questions In Marine Science, Robert M. Levine, Kristen E. Fogaren, Johna E. Rudzin, Christopher J. Russoniello, Dax C. Soule, Justine M. Whitaker

Publications and Research

Ocean observing systems are well-recognized as platforms for long-term monitoring of near-shore and remote locations in the global ocean. High-quality observatory data is freely available and accessible to all members of the global oceanographic community—a democratization of data that is particularly useful for early career scientists (ECS), enabling ECS to conduct research independent of traditional funding models or access to laboratory and field equipment. The concurrent collection of distinct data types with relevance for oceanographic disciplines including physics, chemistry, biology, and geology yields a unique incubator for cutting-edge, timely, interdisciplinary research. These data are both an opportunity and an incentive …


Exploring Information For Quantum Machine Learning Models, Michael Telahun Dec 2020

Exploring Information For Quantum Machine Learning Models, Michael Telahun

Electronic Theses and Dissertations

Quantum computing performs calculations by using physical phenomena and quantum mechanics principles to solve problems. This form of computation theoretically has been shown to provide speed ups to some problems of modern-day processing. With much anticipation the utilization of quantum phenomena in the field of Machine Learning has become apparent. The work here develops models from two software frameworks: TensorFlow Quantum (TFQ) and PennyLane for machine learning purposes. Both developed models utilize an information encoding technique amplitude encoding for preparation of states in a quantum learning model. This thesis explores both the capacity for amplitude encoding to provide enriched state …


Data And Assessment Management In Collegiate Recreation, Jeana Carow Dec 2020

Data And Assessment Management In Collegiate Recreation, Jeana Carow

Graduate Theses and Dissertations

Collegiate recreation programs and centers typically provide traditional programming space in addition to a range of physical activity spaces and resources, as a valuable part of the student experience. The external pressures of identifying and communicating departmental value and impact on the campus community has resulted in collegiate recreation departments’ use of data to communicate the effectiveness and impact of their work. The purpose of the study was to identify the data collection and assessment management practices of collegiate recreation departments, particularly focusing on the organization of data and assessment strategies as well as data collection, storage, reporting, analyzing, and …


Carbon Metabolism In Cave Subaerial Biofilms, Victoria E. Frazier Dec 2020

Carbon Metabolism In Cave Subaerial Biofilms, Victoria E. Frazier

Masters Theses

Subaerial biofilms (SABs) grow at the interface between the atmosphere and rock surfaces in terrestrial and subterranean environments around the world. Multi-colored SABs colonizing relatively dry and nutrient-limited cave surfaces are known to contain microbes putatively involved in chemolithoautotrophic processes using inorganic carbon like carbon dioxide (CO2) or methane (CH4). However, the importance of CO2 and CH4 to SAB biomass production has not been quantified, the environmental conditions influencing biomass production and diversity have not been thoroughly evaluated, and stable carbon and nitrogen isotope compositions have yet to be determined from epigenic cave SABs. …


Incorporating Shear Resistance Into Debris Flow Triggering Model Statistics, Noah J. Lyman Dec 2020

Incorporating Shear Resistance Into Debris Flow Triggering Model Statistics, Noah J. Lyman

Master's Theses

Several regions of the Western United States utilize statistical binary classification models to predict and manage debris flow initiation probability after wildfires. As the occurrence of wildfires and large intensity rainfall events increase, so has the frequency in which development occurs in the steep and mountainous terrain where these events arise. This resulting intersection brings with it an increasing need to derive improved results from existing models, or develop new models, to reduce the economic and human impacts that debris flows may bring. Any development or change to these models could also theoretically increase the ease of collection, processing, and …


Creating Optimal Conditions For Reproducible Data Analysis In R With ‘Fertile’, Audrey M. Bertin, Benjamin Baumer Nov 2020

Creating Optimal Conditions For Reproducible Data Analysis In R With ‘Fertile’, Audrey M. Bertin, Benjamin Baumer

Statistical and Data Sciences: Faculty Publications

The advancement of scientific knowledge increasingly depends on ensuring that data-driven research is reproducible: that two people with the same data obtain the same results. However, while the necessity of reproducibility is clear, there are significant behavioral and technical challenges that impede its widespread implementation and no clear consensus on standards of what constitutes reproducibility in published research. We present fertile, an R package that focuses on a series of common mistakes programmers make while conducting data science projects in R, primarily through the RStudio integrated development environment. fertile operates in two modes: proactively, to prevent reproducibility mistakes from happening …