Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

1,397 Full-Text Articles 2,757 Authors 273,342 Downloads 186 Institutions

All Articles in Data Science

Faceted Search

1,397 full-text articles. Page 7 of 69.

Static Malware Family Clustering Via Structural And Functional Characteristics, David George, Andre Mauldin, Josh Mitchell, Sufiyan Mohammed, Robert Slater 2023 Southern Methodist University

Static Malware Family Clustering Via Structural And Functional Characteristics, David George, Andre Mauldin, Josh Mitchell, Sufiyan Mohammed, Robert Slater

SMU Data Science Review

Static and dynamic analyses are the two primary approaches to analyzing malicious applications. The primary distinction between the two is that the application is analyzed without execution in static analysis, whereas the dynamic approach executes the malware and records the behavior exhibited during execution. Although each approach has advantages and disadvantages, dynamic analysis has been more widely accepted and utilized by the research community whereas static analysis has not seen the same attention. This study aims to apply advancements in static analysis techniques to demonstrate the identification of fine-grained functionality, and show, through clustering, how malicious applications may be grouped …


Using Geographic Information To Explore Player-Specific Movement And Its Effects On Play Success In The Nfl, Hayley Horn, Eric Laigaie, Alexander Lopez, Shravan Reddy 2023 Southern Methodist University

Using Geographic Information To Explore Player-Specific Movement And Its Effects On Play Success In The Nfl, Hayley Horn, Eric Laigaie, Alexander Lopez, Shravan Reddy

SMU Data Science Review

American Football is a billion-dollar industry in the United States. The analytical aspect of the sport is an ever-growing domain, with open-source competitions like the NFL Big Data Bowl accelerating this growth. With the amount of player movement during each play, tracking data can prove valuable in many areas of football analytics. While concussion detection, catch recognition, and completion percentage prediction are all existing use cases for this data, player-specific movement attributes, such as speed and agility, may be helpful in predicting play success. This research calculates player-specific speed and agility attributes from tracking data and supplements them with descriptive …


A Hybrid Ensemble Of Learning Models, Bivin Sadler, Dhruba Dey, Duy Nguyen, Tavin Weeda 2023 Southern Methodist University

A Hybrid Ensemble Of Learning Models, Bivin Sadler, Dhruba Dey, Duy Nguyen, Tavin Weeda

SMU Data Science Review

Statistical models in time series forecasting have long been challenged to be superseded by the advent of deep learning models. This research proposes a new hybrid ensemble of forecasting models that combines the strengths of several strong candidates from these two model types. The proposed ensemble aims to improve the accuracy of forecasts and reduce computational complexity by leveraging the strengths of each candidate model.


Forecasting Covid-19 With Temporal Hierarchies And Ensemble Methods, Li Shandross 2023 University of Massachusetts Amherst

Forecasting Covid-19 With Temporal Hierarchies And Ensemble Methods, Li Shandross

Masters Theses

Infectious disease forecasting efforts underwent rapid growth during the COVID-19 pandemic, providing guidance for pandemic response and about potential future trends. Yet despite their importance, short-term forecasting models often struggled to produce accurate real-time predictions of this complex and rapidly changing system. This gap in accuracy persisted into the pandemic and warrants the exploration and testing of new methods to glean fresh insights.

In this work, we examined the application of the temporal hierarchical forecasting (THieF) methodology to probabilistic forecasts of COVID-19 incident hospital admissions in the United States. THieF is an innovative forecasting technique that aggregates time-series data into …


Graph Representation Learning With Box Embeddings, Dongxu Zhang 2023 University of Massachusetts Amherst

Graph Representation Learning With Box Embeddings, Dongxu Zhang

Doctoral Dissertations

Graphs are ubiquitous data structures, present in many machine-learning tasks, such as link prediction of products and node classification of scientific papers. As gradient descent drives the training of most modern machine learning architectures, the ability to encode graph-structured data using a differentiable representation is essential to make use of this data. Most approaches encode graph structure in Euclidean space, however, it is non-trivial to model directed edges. The naive solution is to represent each node using a separate "source" and "target" vector, however, this can decouple the representation, making it harder for the model to capture information within longer …


Crop Monitoring And Nutrient Prediction Using Satellite Imagery And Soil Data, Olatunde D. Akanbi, Brian Gonzalez Hernandez, Erika I. Barcelos, Arafath Nihar, Laura S. Bruckman, Yinghui Wu, Jeffrey Yarus, Roger H. French 2023 Case Western Reserve University

Crop Monitoring And Nutrient Prediction Using Satellite Imagery And Soil Data, Olatunde D. Akanbi, Brian Gonzalez Hernandez, Erika I. Barcelos, Arafath Nihar, Laura S. Bruckman, Yinghui Wu, Jeffrey Yarus, Roger H. French

Student Scholarship

No abstract provided.


Math And Democracy, Kimberly A. Roth, Erika L. Ward 2023 Juniata College

Math And Democracy, Kimberly A. Roth, Erika L. Ward

Journal of Humanistic Mathematics

Math and Democracy is a math class containing topics such as voting theory, weighted voting, apportionment, and gerrymandering. It was first designed by Erika Ward for math master’s students, mostly educators, but then adapted separately by both Erika Ward and Kim Roth for a general audience of undergraduates. The course contains materials that can be explored in mathematics classes from those for non-majors through graduate students. As such, it serves students from all majors and allows for discussion of fairness, racial justice, and politics while exploring mathematics that non-major students might not otherwise encounter. This article serves as a guide …


Responsible Data Science For Genocide Prevention, Victor Piercey 2023 Ferris State University

Responsible Data Science For Genocide Prevention, Victor Piercey

Journal of Humanistic Mathematics

The term "genocide" emerged out of an effort to describe mass atrocities committed in the first half of the 20th century. Despite a convention of the United Nations outlawing genocide as a matter of international law, the problem persists. Some organizations (including the United Nations) are developing indicator frameworks and “early-warning” systems that leverage data science to produce risk assessments of countries where conflict is present. These tools raise questions about responsible data use, specifically regarding the data sources and social biases built into algorithms through their training data. This essay seeks to engage mathematicians in discussing these concerns.


The Impacts Of Transfer Learning For Ungulate Recognition At Sevilleta National Wildlife Refuge, Michael Gurule 2023 University of New Mexico - Main Campus

The Impacts Of Transfer Learning For Ungulate Recognition At Sevilleta National Wildlife Refuge, Michael Gurule

Geography ETDs

As camera traps have grown in popularity, their utilization has expanded to numerous fields, including wildlife research, conservation, and ecological studies. The information gathered using this equipment gives researchers a precise and comprehensive understanding about the activities of animals in their natural environments. For this type of data to be useful, camera trap images must be labeled so that the species in the images can be classified and counted. This has typically been done by teams of researchers and volunteers, and it can be said that the process is at best time-consuming. With recent developments in deep learning, the process …


Genetic Programming To Optimize Performance Of Machine Learning Algorithms On Unbalanced Data Set, Asitha Thumpati 2023 California State University, San Bernardino

Genetic Programming To Optimize Performance Of Machine Learning Algorithms On Unbalanced Data Set, Asitha Thumpati

Electronic Theses, Projects, and Dissertations

Data collected from the real world is often imbalanced, meaning that the distribution of data across known classes is biased or skewed. When using machine learning classification models on such imbalanced data, predictive performance tends to be lower because these models are designed with the assumption of balanced classes or a relatively equal number of instances for each class. To address this issue, we employ data preprocessing techniques such as SMOTE (Synthetic Minority Oversampling Technique) for oversampling data and random undersampling for undersampling data on unbalanced datasets. Once the dataset is balanced, genetic programming is utilized for feature selection to …


Insights Into The Application Of Deep Reinforcement Learning In Healthcare And Materials Science, Benjamin R. Smith 2023 University of Tennessee, Knoxville

Insights Into The Application Of Deep Reinforcement Learning In Healthcare And Materials Science, Benjamin R. Smith

Doctoral Dissertations

Reinforcement learning (RL) is a type of machine learning designed to optimize sequential decision-making. While controlled environments have served as a foundation for RL research, due to the growth in data volumes and deep learning methods, it is now increasingly being applied to real-world problems. In our work, we explore and attempt to overcome challenges that occur when applying RL to solve problems in healthcare and materials science.

First, we explore how issues in bias and data completeness affect healthcare applications of RL. To understand how bias has already been considered in this area, we survey the literature for existing …


A Data-Driven Multi-Regime Approach For Predicting Real-Time Energy Consumption Of Industrial Machines., Abdulgani Kahraman 2023 University of Louisville

A Data-Driven Multi-Regime Approach For Predicting Real-Time Energy Consumption Of Industrial Machines., Abdulgani Kahraman

Electronic Theses and Dissertations

This thesis focuses on methods for improving energy consumption prediction performance in complex industrial machines. Working with real-world industrial machines brings several challenges, including data access, algorithmic bias, data privacy, and the interpretation of machine learning algorithms. To effectively manage energy consumption in the industrial sector, it is essential to develop a framework that enhances prediction performance, reduces energy costs, and mitigates air pollution in heavy industrial machine operations. This study aims to assist managers in making informed decisions and driving the transition towards green manufacturing. The energy consumption of industrial machinery is substantial, and the recent increase in CO2 …


A Method For Generating A Non-Manual Feature Model For Sign Language Processing, Robert G. Smith Dr, Markus Hofmann Dr 2023 Technological University Dublin

A Method For Generating A Non-Manual Feature Model For Sign Language Processing, Robert G. Smith Dr, Markus Hofmann Dr

Articles

While recent approaches to sign language processing have shifted to the domain of Machine Learning (ML), the treatment of Non-Manual Features (NMFs) remains an open question. The principal challenge facing this method is the comparatively small sign language corpora available for training machine learning models. This study produces a statistical model which may be used in future ML, rules-based, and hybrid-learning approaches for sign language processing tasks. In doing so, this research explores the emerging patterns of non-manual articulation concerning grammatical classes in Irish Sign Language (ISL). The experimental method applied here is a novel implementation of an association rules …


Cannabidiol Tweet Miner: A Framework For Identifying Misinformation In Cbd Tweets., Jason Turner 2023 University of Louisville

Cannabidiol Tweet Miner: A Framework For Identifying Misinformation In Cbd Tweets., Jason Turner

Electronic Theses and Dissertations

As regulations surrounding cannabis continue to develop, the demand for cannabis-based products is on the rise. Despite not producing the psychoactive effects commonly associated with THC, products containing cannabidiol (CBD) have gained immense popularity in recent years as a potential treatment option for a range of conditions, particularly those associated with pain or sleep disorders. However, due to current federal policies, these products have yet to undergo comprehensive safety and efficacy testing. Fortunately, utilizing advanced natural language processing (NLP) techniques, data harvested from social networks have been employed to investigate various social trends within healthcare, such as disease tracking and …


Physics-Guided Deep Learning For Solar Wind Modeling At L1 Point, Robert M. Johnson 2023 Utah State University

Physics-Guided Deep Learning For Solar Wind Modeling At L1 Point, Robert M. Johnson

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Neural networks are adept at finding patterns that are too long and too small for humans to find in data. Usually, this power is used to generate predictions with greater accuracy than most alternative models. However, we can also use this power to understand more about the data we train these networks on. We do this by changing the data that the networks train on and the data they are tested on. This allows us to both control the maximum length of a pattern and to compare data between different groups, in our case, different solar cycles. This thesis is …


Computational Analysis Of Antibody Binding Mechanisms To The Omicron Rbd Of Sars-Cov-2 Spike Protein: Identification Of Epitopes And Hotspots For Developing Effective Therapeutic Strategies, Mohammed Alshahrani 2023 Chapman University

Computational Analysis Of Antibody Binding Mechanisms To The Omicron Rbd Of Sars-Cov-2 Spike Protein: Identification Of Epitopes And Hotspots For Developing Effective Therapeutic Strategies, Mohammed Alshahrani

Computational and Data Sciences (PhD) Dissertations

The advent of the Omicron strain of SARS-CoV-2 has elicited apprehension regarding its potential influence on the effectiveness of current vaccines and antibody treatments. The present investigation involved the implementation of mutational scanning analyses to examine the impact of Omicron mutations on the binding affinity of four categories of antibodies that target the Omicron receptor binding domain (RBD) of the Spike protein. The study demonstrates that the Omicron variant harbors 23 unique mutations across the RBD regions I, II, III, and IV. Of these mutations, seven are shared between RBD regions I and II, while three are shared among RBD …


The Influence Of Allostery Governing The Changes In Protein Dynamics Upon Substitution, Joseph Hess 2023 Clemson University

The Influence Of Allostery Governing The Changes In Protein Dynamics Upon Substitution, Joseph Hess

All Dissertations

The focus of this research is to investigate the effects of allostery on the function/activity of an enzyme, human immunodeficiency virus type 1 (HIV-1) protease, using well-defined statistical analyses of the dynamic changes of the protein and variants with unique single point substitutions 1. The experimental data1 evaluated here only characterized HIV-1 protease with one of its potential target substrates. Probing the dynamic interactions of the residues of an enzyme and its variants can offer insight of the developmental importance for allosteric signaling and their connection to a protein’s function. The realignment of the secondary structure elements can …


Understanding The Role Of Interactivity And Explanation In Adaptive Experiences, Lijie Guo 2023 Clemson University

Understanding The Role Of Interactivity And Explanation In Adaptive Experiences, Lijie Guo

All Dissertations

Adaptive experiences have been an active area of research in the past few decades, accompanied by advances in technology such as machine learning and artificial intelligence. Whether the currently ongoing research on adaptive experiences has focused on personalization algorithms, explainability, user engagement, or privacy and security, there is growing interest and resources in developing and improving these research focuses. Even though the research on adaptive experiences has been dynamic and rapidly evolving, achieving a high level of user engagement in adaptive experiences remains a challenge. %????? This dissertation aims to uncover ways to engage users in adaptive experiences by incorporating …


Topological Data Analysis Of Convolutional Neural Networks Using Depthwise Separable Convolutions, Eliot Courtois 2023 University of Missouri-St. Louis

Topological Data Analysis Of Convolutional Neural Networks Using Depthwise Separable Convolutions, Eliot Courtois

Dissertations

In this dissertation, we present our contribution to a growing body of work combining the fields of Topological Data Analysis (TDA) and machine learning. The object of our analysis is the Convolutional Neural Network, or CNN, a predictive model with a large number of parameters organized using a grid-like geometry. This geometry is engineered to resemble patches of pixels in an image, and thus CNNs are a conventional choice for an image-classifying model.

CNNs belong to a larger class of neural network models, which, starting at a random initialization state, undergo a gradual fitting (or training) process, often a …


Hyperspectral Point Cloud Projection For The Semantic Segmentation Of Multimodal Hyperspectral And Lidar Data With Point Convolution-Based Deep Fusion Neural Networks, Kevin T. Decker, Brett J. Borghetti 2023 Riverside Research Institute

Hyperspectral Point Cloud Projection For The Semantic Segmentation Of Multimodal Hyperspectral And Lidar Data With Point Convolution-Based Deep Fusion Neural Networks, Kevin T. Decker, Brett J. Borghetti

Faculty Publications

The fusion of dissimilar data modalities in neural networks presents a significant challenge, particularly in the case of multimodal hyperspectral and lidar data. Hyperspectral data, typically represented as images with potentially hundreds of bands, provide a wealth of spectral information, while lidar data, commonly represented as point clouds with millions of unordered points in 3D space, offer structural information. The complementary nature of these data types presents a unique challenge due to their fundamentally different representations requiring distinct processing methods. In this work, we introduce an alternative hyperspectral data representation in the form of a hyperspectral point cloud (HSPC), which …


Digital Commons powered by bepress