Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Data Science

William & Mary

Undergraduate Honors Theses

Publication Year

Articles 1 - 14 of 14

Full-Text Articles in Physical Sciences and Mathematics

Security And Interpretability In Large Language Models, Lydia Danas May 2024

Security And Interpretability In Large Language Models, Lydia Danas

Undergraduate Honors Theses

Large Language Models (LLMs) have the capability to model long-term dependencies in sequences of tokens, and are consequently often utilized to generate text through language modeling. These capabilities are increasingly being used for code generation tasks; however, LLM-powered code generation tools such as GitHub's Copilot have been generating insecure code and thus pose a cybersecurity risk. To generate secure code we must first understand why LLMs are generating insecure code. This non-trivial task can be realized through interpretability methods, which investigate the hidden state of a neural network to explain model outputs. A new interpretability method is rationales, which obtains …


Code Syntax Understanding In Large Language Models, Cole Granger May 2024

Code Syntax Understanding In Large Language Models, Cole Granger

Undergraduate Honors Theses

In recent years, tasks for automated software engineering have been achieved using Large Language Models trained on source code, such as Seq2Seq, LSTM, GPT, T5, BART and BERT. The inherent textual nature of source code allows it to be represented as a sequence of sub-words (or tokens), drawing parallels to prior work in NLP. Although these models have shown promising results according to established metrics (e.g., BLEU, CODEBLEU), there remains a deeper question about the extent of syntax knowledge they truly grasp when trained and fine-tuned for specific tasks.

To address this question, this thesis introduces a taxonomy of syntax …


Roads And Corresponding Travel Time To Markets: Assessing Climate Vulnerability In Nepal, Kaitlyn Crowley May 2024

Roads And Corresponding Travel Time To Markets: Assessing Climate Vulnerability In Nepal, Kaitlyn Crowley

Undergraduate Honors Theses

Roads exist as a physical and theoretical connection between people and places around the globe. In addition to providing a route from one point to another, roads are also an indicator of access to markets and of poverty. However, current road datasets, particularly the Global Roads Open Access Data Set, are out of date or incomplete, necessitating new sources of data for analyses involving road networks. This study explores the relationship between climate change and access to markets in Nepal. We seek to identify isolated communities that are likely to experience detrimental outcomes associated with environmental threats, such as increasing …


Parameter Estimation For Patient Enrollment In Clinical Trials, Junyan Liu Dec 2023

Parameter Estimation For Patient Enrollment In Clinical Trials, Junyan Liu

Undergraduate Honors Theses

In this paper, we study the Poisson-gamma model for recruitment time in clinical trials. We proved several properties of this model that match our intuitions from a reliability perspective, did simulations on this model, and used different optimization methods to estimate the parameters. Although the behaviors of the optimization methods were unfavorable and unstable, we identified certain conditions and provided potential explanations for this phenomenon and further insights into the Poisson-gamma model.


Seeing What We Can't: Evaluating Implicit Biases In Deep Learning Satellite Imagery Models Trained For Poverty Prediction, Joseph O'Brien May 2023

Seeing What We Can't: Evaluating Implicit Biases In Deep Learning Satellite Imagery Models Trained For Poverty Prediction, Joseph O'Brien

Undergraduate Honors Theses

Previous studies have sought to use Convolutional Neural Networks for regional estimation of poverty levels. However, there is limited research into possible implicit biases in deep neural networks in the context of satellite imagery. In this work, we develop a deep learning model to predict the tertile of per-capita asset consumption, trained on satellite imagery and World Bank Living Standards Measurements Study data. Using satellite imagery collected via survey location data as inputs, we use transfer learning to train a VGG-16 Convolutional Neural Network to classify images based on per-capita consumption. The model achieves an $R^2$ of .74, using thousands …


A Satellite Imagery Approach To Estimating Migratory Flows In Guatemala Using Convolutional Neural Networks, Sarah Larimer May 2023

A Satellite Imagery Approach To Estimating Migratory Flows In Guatemala Using Convolutional Neural Networks, Sarah Larimer

Undergraduate Honors Theses

Being able to predict migratory flows is important in ensuring political, social, and economic stability. In the wake of violence, unrest, natural disasters, and social pressures, millions of mi- grants have fled Central America in search of a better life. However, due to the infrequent nature and high cost of census data, there is a need for a more remote and up to date approaches. Con- volutional Neural Networks offer a computer vision based approach that is cheaper and with significantly less lag. In this study, we seek to evaluate the effectiveness of different convolu- tional neural networks in predicting …


Considering The Accuracy Of Fiat Boundaries: Ontology And Quantification, Lydia Troup May 2023

Considering The Accuracy Of Fiat Boundaries: Ontology And Quantification, Lydia Troup

Undergraduate Honors Theses

Administrative boundaries - i.e., states, counties, or districts - are fiat boundaries; they exist purely as defined by human interpretation. Because of this, and despite their critical importance to government functions, the accuracy of data products claiming to represent such boundaries is difficult to measure. Here, I explore this topic using three boundary data sets: the open source geoBoundaries data set, the humanitarian UN OCHA’s Common Operational Datasets (COD), and Esri’s commercial administrative divisions 0 and 1 data sets in the Living Atlas. The accuracy of each was quantified as the percent overlap between each data set and an authoritative …


Identifying Social Media Users That Are Susceptible To Phishing Attacks, Zoe Metzger May 2023

Identifying Social Media Users That Are Susceptible To Phishing Attacks, Zoe Metzger

Undergraduate Honors Theses

Phishing scams are a billion-dollar problem. According to Threatpost, in 2020, business email compromise phishing attacks cost the US economy $ 1.8 billion. Social media phishing scams are also on the rise with 74% of companies experiencing social media attacks in 2021 according to Proofpoint. Educating users about phishing scams is an effective strategy for reducing phishing attacks. Despite efforts to combat phishing, the number of attacks continues to rise, likely indicative of a reticence of users to change online behaviors. Existing research into predicting vulnerable social media users that are susceptible to phishing mostly focuses on content analysis of …


Using Deep Learning With Satellite Imagery To Estimate Deforestation Rates, Maeve Naughton-Rockwell May 2022

Using Deep Learning With Satellite Imagery To Estimate Deforestation Rates, Maeve Naughton-Rockwell

Undergraduate Honors Theses

Previous studies have used Convolutional Neural Networks for regional detection of deforestation breaks. However, there is limited research into the capability of deep neural networks to identify sudden shifts in global forest cover from satellite imagery. Additionally, many deforestation detection models are trained on region specific data and need manual input thresholds. In this work, we develop a deep learning model to predict the percent of deforestation in a region between two points in time, trained on globally sourced data. Using the before and after satellite images of a deforestation event as inputs, we implemented a two input Convolutional Neural …


Using A Machine Learning Model To Predict Plant Inflorescences Based Upon Its Soil Microbiome, Luke Denoncourt May 2022

Using A Machine Learning Model To Predict Plant Inflorescences Based Upon Its Soil Microbiome, Luke Denoncourt

Undergraduate Honors Theses

The UN estimates that the global population could reach 9.7 billion by 2050 (United Nations). As a result, the amount of food required to feed humanity is thought to double by 2050 (Ray et al., 2012). Humanity must find a way to increase crop production without increasing fertilizer usage and eutrophication, which can be done using the soil microbiome. Using potted plants with soils inoculated with Pseudomonas alcaligenes, Pseudomonas denitrificans, Bacillus polymyxa, and Mycobacterium phlei, both the shoot and root growth of pea and cotton plants was significantly increased (Egamberdieva & Höflich, 2004). In this study, utilizing a random forest …


The Pandemic From Above: Estimating Covid-19 Cases Using Deep Learning And Satellite Imagery, John Hennin Apr 2022

The Pandemic From Above: Estimating Covid-19 Cases Using Deep Learning And Satellite Imagery, John Hennin

Undergraduate Honors Theses

Monitoring the spread of an outbreak of disease (such as COVID-19) is an important component of any coordinated pandemic response. Across the globe, our ability to conduct such monitoring - especially at early stages of the COVID- 19 pandemic - was highly limited due to a lack of public reporting mechanisms. Today, the process of case data collection remains expensive and, in some regions, is subject to political considerations. Researchers have turned to some techniques leveraging Google Trends and Twitter data to overcome limitations in public data sources. Here, we provide another approach which leverages satellite information to provide estimates …


Machine Learning In Healthcare: Improving The Diagnosis Of Pulmonary Embolism In Covid-19 Patients, Soheb Osmani Apr 2022

Machine Learning In Healthcare: Improving The Diagnosis Of Pulmonary Embolism In Covid-19 Patients, Soheb Osmani

Undergraduate Honors Theses

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has created new challenges for clinicians diagnosing pulmonary embolism (PE). Clinicians currently rely on D-Dimer levels in conjunction with clinical prediction scores to rule out and diagnose PE. However, patients with COVID-19 (the disease caused by SARS-CoV-2) often present with elevated D-Dimer levels. D-Dimer levels in COVID-19 patients have been found to be positively correlated with the severity of disease. Symptoms of COVID-19 also often align with symptoms of PE. Therefore, it becomes more difficult for clinicians to identify which COVID-19 positive patients should undergo further testing for PE. This study evaluates …


Molecular Cluster Fragment Machine Learning Training Techniques To Predict Energetics Of Brown Carbon Aerosol Clusters, Emily E. Chappie May 2021

Molecular Cluster Fragment Machine Learning Training Techniques To Predict Energetics Of Brown Carbon Aerosol Clusters, Emily E. Chappie

Undergraduate Honors Theses

Density functional theory (DFT) has become a popular method for computational work involving larger molecular systems as it provides accuracy that rivals ab initio methods while lowering computational cost. Nevertheless, computational cost is still high for systems greater than ten atoms in size, preventing their application in modeling realistic atmospheric systems at the molecular level. Machine learning techniques, however, show promise as cost-effective tools in predicting chemical properties when properly trained. In the interest of furthering chemical machine learning in the field of atmospheric science, I have developed a training method for predicting cluster energetics of newly characterized nitrogen-based brown …


Scope: Building And Testing An Integrated Manual-Automated Event Extraction Tool For Online Text-Based Media Sources, Matthew Crittenden May 2021

Scope: Building And Testing An Integrated Manual-Automated Event Extraction Tool For Online Text-Based Media Sources, Matthew Crittenden

Undergraduate Honors Theses

Building on insights from two years of manually extracting events information from online news media, an interactive information extraction environment (IIEE) was developed. SCOPE, the Scientific Collection of Open-source Policy Evidence, is a Python Django-based tool divided across specialized modules for extracting structured events data from unstructured text. These modules are grouped into a flexible framework which enables the user to tailor the tool to meet their needs. Following principles of user-oriented learning for information extraction (IE), SCOPE offers an alternative approach to developing AI-assisted IE systems. In this piece, we detail the ongoing development of the SCOPE tool, present …