Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Theses/Dissertations

Machine learning

Institution
Publication Year
Publication

Articles 1 - 30 of 57

Full-Text Articles in Data Science

Automated Identification And Mapping Of Interesting Mineral Spectra In Crism Images, Arun M. Saranathan Mar 2024

Automated Identification And Mapping Of Interesting Mineral Spectra In Crism Images, Arun M. Saranathan

Doctoral Dissertations

The Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) has proven to be an invaluable tool for the mineralogical analysis of the Martian surface. It has been crucial in identifying and mapping the spatial extents of various minerals. Primarily, the identification and mapping of these mineral spectral-shapes have been performed manually. Given the size of the CRISM image dataset, manual analysis of the full dataset would be arduous/infeasible. This dissertation attempts to address this issue by describing an (machine learning based) automated processing pipeline for CRISM data that can be used to identify and map the unique mineral signatures present in …


Data To Science With Ai And Human-In-The-Loop, Gustavo Perez Sarabia Mar 2024

Data To Science With Ai And Human-In-The-Loop, Gustavo Perez Sarabia

Doctoral Dissertations

AI has the potential to accelerate scientific discovery by enabling scientists to analyze vast datasets more efficiently than traditional methods. For example, this thesis considers the detection of star clusters in high-resolution images of galaxies taken from space telescopes, as well as studying bird migration from RADAR images. In these applications, the goal is to make measurements to answer scientific questions, such as how the star formation rate is affected by mass, or how the phenology of bird migration is influenced by climate change. However, current computer vision systems are far from perfect for conducting these measurements directly. They may …


Adaptive Multi-Label Classification On Drifting Data Streams, Martha Roseberry Jan 2024

Adaptive Multi-Label Classification On Drifting Data Streams, Martha Roseberry

Theses and Dissertations

Drifting data streams and multi-label data are both challenging problems. When multi-label data arrives as a stream, the challenges of both problems must be addressed along with additional challenges unique to the combined problem. Algorithms must be fast and flexible, able to match both the speed and evolving nature of the stream. We propose four methods for learning from multi-label drifting data streams. First, a multi-label k Nearest Neighbors with Self Adjusting Memory (ML-SAM-kNN) exploits short- and long-term memories to predict the current and evolving states of the data stream. Second, a punitive k nearest neighbors algorithm with a self-adjusting …


Learning Mortality Risk For Covid-19 Using Machine Learning And Statistical Methods, Shaoshi Zhang Dec 2023

Learning Mortality Risk For Covid-19 Using Machine Learning And Statistical Methods, Shaoshi Zhang

Electronic Thesis and Dissertation Repository

This research investigates the mortality risk of COVID-19 patients across different variant waves, using the data from Centers for Disease Control and Prevention (CDC) websites. By analyzing the available data, including patient medical records, vaccination rates, and hospital capacities, we aim to discern patterns and factors associated with COVID-19-related deaths.

To explore features linked to COVID-19 mortality, we employ different techniques such as Filter, Wrapper, and Embedded methods for feature selection. Furthermore, we apply various machine learning methods, including support vector machines, decision trees, random forests, logistic regression, K-nearest neighbours, na¨ıve Bayes methods, and artificial neural networks, to uncover underlying …


Cm-Ii Meditation As An Intervention To Reduce Stress And Improve Attention: A Study Of Ml Detection, Spectral Analysis, And Hrv Metrics, Sreekanth Gopi Dec 2023

Cm-Ii Meditation As An Intervention To Reduce Stress And Improve Attention: A Study Of Ml Detection, Spectral Analysis, And Hrv Metrics, Sreekanth Gopi

Master of Science in Computer Science Theses

Students frequently face heightened stress due to academic and social pressures, particularly in de- manding fields like computer science and engineering. These challenges are often associated with serious mental health issues, including ADHD (Attention Deficit Hyperactivity Disorder), depression, and an increased risk of suicide. The average student attention span has notably decreased from 21⁄2 minutes to just 47 seconds, and now it typically takes about 25 minutes to switch attention to a new task (Mark, 2023). Research findings suggest that over 95% of individuals who die by suicide have been diagnosed with depression (Shahtahmasebi, 2013), and almost 20% of students …


Exact Models, Heuristics, And Supervised Learning Approaches For Vehicle Routing Problems, Zefeng Lyu Dec 2023

Exact Models, Heuristics, And Supervised Learning Approaches For Vehicle Routing Problems, Zefeng Lyu

Doctoral Dissertations

This dissertation presents contributions to the field of vehicle routing problems by utilizing exact methods, heuristic approaches, and the integration of machine learning with traditional algorithms. The research is organized into three main chapters, each dedicated to a specific routing problem and a unique methodology. The first chapter addresses the Pickup and Delivery Problem with Transshipments and Time Windows, a variant that permits product transfers between vehicles to enhance logistics flexibility and reduce costs. To solve this problem, we propose an efficient mixed-integer linear programming model that has been shown to outperform existing ones. The second chapter discusses a practical …


Towards Robust Long-Form Text Generation Systems, Kalpesh Krishna Nov 2023

Towards Robust Long-Form Text Generation Systems, Kalpesh Krishna

Doctoral Dissertations

Text generation is an important emerging AI technology that has seen significant research advances in recent years. Due to its closeness to how humans communicate, mastering text generation technology can unlock several important applications such as intelligent chat-bots, creative writing assistance, or newer applications like task-agnostic few-shot learning. Most recently, the rapid scaling of large language models (LLMs) has resulted in systems like ChatGPT, capable of generating fluent, coherent and human-like text. However, despite their remarkable capabilities, LLMs still suffer from several limitations, particularly when generating long-form text. In particular, (1) long-form generated text is filled with factual inconsistencies to …


Spoken Language Processing And Modeling For Aviation Communications, Aaron Van De Brook Oct 2023

Spoken Language Processing And Modeling For Aviation Communications, Aaron Van De Brook

Doctoral Dissertations and Master's Theses

With recent advances in machine learning and deep learning technologies and the creation of larger aviation-specific corpora, applying natural language processing technologies, especially those based on transformer neural networks, to aviation communications is becoming increasingly feasible. Previous work has focused on machine learning applications to natural language processing, such as N-grams and word lattices. This thesis experiments with a process for pretraining transformer-based language models on aviation English corpora and compare the effectiveness and performance of language models transfer learned from pretrained checkpoints and those trained from their base weight initializations (trained from scratch). The results suggest that transformer language …


Data-Optimized Spatial Field Predictions For Robotic Adaptive Sampling: A Gaussian Process Approach, Zachary Nathan May 2023

Data-Optimized Spatial Field Predictions For Robotic Adaptive Sampling: A Gaussian Process Approach, Zachary Nathan

Computer Science Senior Theses

We introduce a framework that combines Gaussian Process models, robotic sensor measurements, and sampling data to predict spatial fields. In this context, a spatial field refers to the distribution of a variable throughout a specific area, such as temperature or pH variations over the surface of a lake. Whereas existing methods tend to analyze only the particular field(s) of interest, our approach optimizes predictions through the effective use of all available data. We validated our framework on several datasets, showing that errors can decline by up to two-thirds through the inclusion of additional colocated measurements. In support of adaptive sampling, …


Beyond News Values On Twitter: Predicting Factors That Drive User Engagement In News, Zhiyan Zhong Apr 2023

Beyond News Values On Twitter: Predicting Factors That Drive User Engagement In News, Zhiyan Zhong

Dartmouth College Master’s Theses

When deciding on what news stories to cover, traditional journalism determines news values by following several elements of newsworthiness, such as impact, timeliness, and prominence. However, these guidelines do not always seem to correspond with the success of content on social media. As people are increasingly turning to social media for news, our research aims to understand and predict factors that drive user engagement for news on social media. In this study, we analyze news content published on Twitter, and examine a diverse set of characteristics like metrics retrieved from the Twitter API and semantics by natural language processing, including …


Visual Analytics And Modeling Of Materials Property Data, Diwas Bhattarai Jan 2023

Visual Analytics And Modeling Of Materials Property Data, Diwas Bhattarai

LSU Doctoral Dissertations

Due to significant advancements in experimental and computational techniques, materials data are abundant. To facilitate data-driven research, it calls for a system for managing and sharing data and supporting a set of tools for effective data analysis and modeling. Generally, a given material property M can be considered as a multivariate data problem. The dimensions of M are the values of the property itself, the conditions (pressure P, temperature T, and multi-component composition X) that control the concerned property, and relevant metadata I (source, date).

Here we present a comprehensive database considering both experimental and computational sources …


Invasive Buckthorn Mapping: A Uav-Based Approach Utilizing Machine Learning, Gis, And Remote Sensing Techniques In The Upper Peninsula Of Michigan, Vikranth Madeppa Jan 2023

Invasive Buckthorn Mapping: A Uav-Based Approach Utilizing Machine Learning, Gis, And Remote Sensing Techniques In The Upper Peninsula Of Michigan, Vikranth Madeppa

Dissertations, Master's Theses and Master's Reports

An Invasive species is a species that is alien or non-native to the ecosystem which causes harm to economic, environmental, or human health (E.O. 13112 of Feb 3, 1999). Invasive species have posed a serious threat to ecosystems across the globe. These invasive species have impacts on the biodiversity and productivity of invaded forests. Remotely sensed data is a valuable resource for understanding and addressing issues related to invasive species. This study presents a novel approach for mapping the distribution of two invasive plant species, Common and Glossy Buckthorn, using unmanned aerial vehicles (UAVs), machine learning algorithms, geographic information systems …


Unlocking User Identity: A Study On Mouse Dynamics In Dual Gaming Environments For Continuous Authentication, Marcho Setiawan Handoko Jan 2023

Unlocking User Identity: A Study On Mouse Dynamics In Dual Gaming Environments For Continuous Authentication, Marcho Setiawan Handoko

All Graduate Theses, Dissertations, and Other Capstone Projects

With the surge in information management technology reliance and the looming presence of cyber threats, user authentication has become paramount in computer security. Traditional static or one-time authentication has its limitations, prompting the emergence of continuous authentication as a frontline approach for enhanced security. Continuous authentication taps into behavior-based metrics for ongoing user identity validation, predominantly utilizing machine learning techniques to continually model user behaviors. This study elucidates the potential of mouse movement dynamics as a key metric for continuous authentication. By examining mouse movement patterns across two contrasting gaming scenarios - the high-intensity "Team Fortress" and the low-intensity strategic …


The Basil Technique: Bias Adaptive Statistical Inference Learning Agents For Learning From Human Feedback, Jonathan Indigo Watson Jan 2023

The Basil Technique: Bias Adaptive Statistical Inference Learning Agents For Learning From Human Feedback, Jonathan Indigo Watson

Theses and Dissertations--Computer Science

We introduce a novel approach for learning behaviors using human-provided feedback that is subject to systematic bias. Our method, known as BASIL, models the feedback signal as a combination of a heuristic evaluation of an action's utility and a probabilistically-drawn bias value, characterized by unknown parameters. We present both the general framework for our technique and specific algorithms for biases drawn from a normal distribution. We evaluate our approach across various environments and tasks, comparing it to interactive and non-interactive machine learning methods, including deep learning techniques, using human trainers and a synthetic oracle with feedback distorted to varying degrees. …


Time Series Forecasting For Stock Market Prices, Albert Zhou Jan 2023

Time Series Forecasting For Stock Market Prices, Albert Zhou

Senior Honors Projects

No abstract provided.


Application Of Big Data Technology, Text Classification, And Azure Machine Learning For Financial Risk Management Using Data Science Methodology, Oluwaseyi A. Ijogun Jan 2023

Application Of Big Data Technology, Text Classification, And Azure Machine Learning For Financial Risk Management Using Data Science Methodology, Oluwaseyi A. Ijogun

Electronic Theses and Dissertations

Data science plays a crucial role in enabling organizations to optimize data-driven opportunities within financial risk management. It involves identifying, assessing, and mitigating risks, ultimately safeguarding investments, reducing uncertainty, ensuring regulatory compliance, enhancing decision-making, and fostering long-term sustainability. This thesis explores three facets of Data Science projects: enhancing customer understanding, fraud prevention, and predictive analysis, with the goal of improving existing tools and enabling more informed decision-making. The first project examined leveraged big data technologies, such as Hadoop and Spark, to enhance financial risk management by accurately predicting loan defaulters and their repayment likelihood. In the second project, we investigated …


The Interaction Of Different Primary Producers And Physical And Chemical Dynamics Of An Urban Shallow Lake, Majid Sahin Sep 2022

The Interaction Of Different Primary Producers And Physical And Chemical Dynamics Of An Urban Shallow Lake, Majid Sahin

Dissertations, Theses, and Capstone Projects

An artificial urban shallow lake, Prospect Park Lake (PPL), is situated on a terminal moraine in Brooklyn New York, and supplied with municipal water treated with ortho-phosphates. The constant input of the phosphate nutrient is the primary source of eutrophication in the lake. The numerous pools along the water course houses various aquatic phototrophs, which influence the water quality and the state of the system, driving conditions into favoring the survival of their species. In the first half of the dissertation, the focus of the project is on analyzing how the different primary producers in different regions of PPL affect …


Data Collection And Machine Learning Methods For Automated Pedestrian Facility Detection And Mensuration, Joseph Bailey Luttrell Iv Aug 2022

Data Collection And Machine Learning Methods For Automated Pedestrian Facility Detection And Mensuration, Joseph Bailey Luttrell Iv

Dissertations

Large-scale collection of pedestrian facility (crosswalks, sidewalks, etc.) presence data is vital to the success of efforts to improve pedestrian facility management, safety analysis, and road network planning. However, this kind of data is typically not available on a large scale due to the high labor and time costs that are the result of relying on manual data collection methods. Therefore, methods for automating this process using techniques such as machine learning are currently being explored by researchers. In our work, we mainly focus on machine learning methods for the detection of crosswalks and sidewalks from both aerial and street-view …


Solving The Challenges Of Concept Drift In Data Stream Classification., Hanqing Hu Aug 2022

Solving The Challenges Of Concept Drift In Data Stream Classification., Hanqing Hu

Electronic Theses and Dissertations

The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, …


Leveraging Context Patterns For Medical Entity Classification, Garrett Johnston Jun 2022

Leveraging Context Patterns For Medical Entity Classification, Garrett Johnston

Computer Science Senior Theses

The ability of patients to understand health-related text is important for optimal health outcomes. A system that can automatically annotate medical entities could help patients better understand health-related text. Such a system would also accelerate manual data annotation for this low-resource domain as well as assist in down- stream medical NLP tasks such as finding textual similarity, identifying conflicting medical advice, and aspect-based sentiment analysis. In this work, we investigate a state-of-the-art entity set expansion model, BootstrapNet, for the task of medical entity classification on a new dataset of medical advice text. We also propose EP SBERT, a simple model …


Un-Fair Trojan: Targeted Backdoor Attacks Against Model Fairness, Nicholas Furth May 2022

Un-Fair Trojan: Targeted Backdoor Attacks Against Model Fairness, Nicholas Furth

Theses

Machine learning models have been shown to be vulnerable against various backdoor and data poisoning attacks that adversely affect model behavior. Additionally, these attacks have been shown to make unfair predictions with respect to certain protected features. In federated learning, multiple local models contribute to a single global model communicating only using local gradients, the issue of attacks become more prevalent and complex. Previously published works revolve around solving these issues both individually and jointly. However, there has been little study on the effects of attacks against model fairness. Demonstrated in this work, a flexible attack, which we call Un-Fair …


Computational Approaches To Facilitate Automated Interchange Between Music And Art, Rao Hamza Ali May 2022

Computational Approaches To Facilitate Automated Interchange Between Music And Art, Rao Hamza Ali

Computational and Data Sciences (PhD) Dissertations

Recently, there has been a tremendous increase in generating and synthesizing music and art using various computational techniques. An area that is still under-researched, however, is how one medium can be converted into the other, while maintaining the overall aesthetics. Over the last few centuries, artists, composers, and scholars, have attempted to use substitute one form of art for the other: by proposing techniques where music notes are synonymous to colors, by inventing instruments that combine the aesthetics of music and visual art, and by incorporating the two media in live performances. A widely accepted computational approach, for the conversion, …


Intraday Algorithmic Trading Using Momentum And Long Short-Term Memory Network Strategies, Andrew R. Whitinger Ii May 2022

Intraday Algorithmic Trading Using Momentum And Long Short-Term Memory Network Strategies, Andrew R. Whitinger Ii

Undergraduate Honors Theses

Intraday stock trading is an infamously difficult and risky strategy. Momentum and reversal strategies and long short-term memory (LSTM) neural networks have been shown to be effective for selecting stocks to buy and sell over time periods of multiple days. To explore whether these strategies can be effective for intraday trading, their implementations were simulated using intraday price data for stocks in the S&P 500 index, collected at 1-second intervals between February 11, 2021 and March 9, 2021 inclusive. The study tested 160 variations of momentum and reversal strategies for profitability in long, short, and market-neutral portfolios, totaling 480 portfolios. …


Dataset Evaluation For Data Trading Using Expected Loss And Homomorphic Encryption, Minsung Joo May 2022

Dataset Evaluation For Data Trading Using Expected Loss And Homomorphic Encryption, Minsung Joo

Senior Honors Papers / Undergraduate Theses

Supervised machine learning suffers from the ``garbage-in garbage-out" phenomenon where the performance of a model is limited by the quality of the data. While a myriad of data is collected every second, there is no general rigorous method of evaluating the quality of a given dataset. This hinders fair pricing of data in scenarios where a buyer may look to buy data for use with machine learning. In this work, I propose using the expected loss corresponding to a dataset as a measure of its quality, relying on Bayesian methods for uncertainty quantification. Furthermore, I present a secure multi-party computation …


Beyond Accuracy In Machine Learning., Aneseh Alvanpour May 2022

Beyond Accuracy In Machine Learning., Aneseh Alvanpour

Electronic Theses and Dissertations

Machine Learning (ML) algorithms are widely used in our daily lives. The need to increase the accuracy of ML models has led to building increasingly powerful and complex algorithms known as black-box models which do not provide any explanations about the reasons behind their output. On the other hand, there are white-box ML models which are inherently interpretable while having lower accuracy compared to black-box models. To have a productive and practical algorithmic decision system, precise predictions may not be sufficient. The system may need to have transparency and be able to provide explanations, especially in applications with safety-critical contexts …


New Debiasing Strategies In Collaborative Filtering Recommender Systems: Modeling User Conformity, Multiple Biases, And Causality., Mariem Boujelbene May 2022

New Debiasing Strategies In Collaborative Filtering Recommender Systems: Modeling User Conformity, Multiple Biases, And Causality., Mariem Boujelbene

Electronic Theses and Dissertations

Recommender Systems are widely used to personalize the user experience in a diverse set of online applications ranging from e-commerce and education to social media and online entertainment. These State of the Art AI systems can suffer from several biases that may occur at different stages of the recommendation life-cycle. For instance, using biased data to train recommendation models may lead to several issues, such as the discrepancy between online and offline evaluation, decreasing the recommendation performance, and hurting the user experience. Bias can occur during the data collection stage where the data inherits the user-item interaction biases, such as …


Intra-Hour Solar Forecasting Using Cloud Dynamics Features Extracted From Ground-Based Infrared Sky Images, Guillermo Terrén-Serrano Apr 2022

Intra-Hour Solar Forecasting Using Cloud Dynamics Features Extracted From Ground-Based Infrared Sky Images, Guillermo Terrén-Serrano

Electrical and Computer Engineering ETDs

Due to the increasing use of photovoltaic systems, power grids are vulnerable to the projection of shadows from moving clouds. An intra-hour solar forecast provides power grids with the capability of automatically controlling the dispatch of energy, reducing the additional cost for a guaranteed, reliable supply of energy (i.e., energy storage). This dissertation introduces a novel sky imager consisting of a long-wave radiometric infrared camera and a visible light camera with a fisheye lens. The imager is mounted on a solar tracker to maintain the Sun in the center of the images throughout the day, reducing the scattering effect produced …


A Predictive Model To Predict Cyberattack Using Self-Normalizing Neural Networks, Oluwapelumi Eniodunmo Jan 2022

A Predictive Model To Predict Cyberattack Using Self-Normalizing Neural Networks, Oluwapelumi Eniodunmo

Theses, Dissertations and Capstones

Cyberattack is a never-ending war that has greatly threatened secured information systems. The development of automated and intelligent systems provides more computing power to hackers to steal information, destroy data or system resources, and has raised global security issues. Statistical and Data mining tools have received continuous research and improvements. These tools have been adopted to create sophisticated intrusion detection systems that help information systems mitigate and defend against cyberattacks. However, the advancement in technology and accessibility of information makes more identifiable elements that can be used to gain unauthorized access to systems and resources. Data mining and classification tools …


A Citizen-Science Approach For Urban Flood Risk Analysis Using Data Science And Machine Learning, Candace Agonafir Jan 2022

A Citizen-Science Approach For Urban Flood Risk Analysis Using Data Science And Machine Learning, Candace Agonafir

Dissertations and Theses

Street flooding is problematic in urban areas, where impervious surfaces, such as concrete, brick, and asphalt prevail, impeding the infiltration of water into the ground. During rain events, water ponds and rise to levels that cause considerable economic damage and physical harm. The main goal of this dissertation is to develop novel approaches toward the comprehension of urban flood risk using data science techniques on crowd-sourced data. This is accomplished by developing a series of data-driven models to identify flood factors of significance and localized areas of flood vulnerability in New York City (NYC). First, the infrastructural (catch basin clogs, …


Classifying Blood Glucose Levels Through Noninvasive Features, Rishi Reddy Jan 2022

Classifying Blood Glucose Levels Through Noninvasive Features, Rishi Reddy

Graduate Theses, Dissertations, and Problem Reports

Blood glucose monitoring is a key process in the prevention and management of certain chronic diseases, such as diabetes. Currently, glucose monitoring for those interested in their blood glucose levels are confronted with options that are primarily invasive and relatively costly. A growing topic of note is the development of non-invasive monitoring methods for blood glucose. This development holds a significant promise for improvement to the quality of life of a significant portion of the population and is overall met with great enthusiasm from the scientific community as well as commercial interest. This work aims to develop a potential pipeline …