Open Access. Powered by Scholars. Published by Universities.®

Applied Statistics Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistics

Discipline
Institution
Publication Year
Publication
Publication Type
File Type

Articles 1 - 30 of 105

Full-Text Articles in Applied Statistics

Differentiation Of Human, Dog, And Cat Hair Fibers Using Dart Tofms And Machine Learning, Laura Ahumada, Erin R. Mcclure-Price, Chad Kwong, Edgard O. Espinoza, John Santerre Dec 2023

Differentiation Of Human, Dog, And Cat Hair Fibers Using Dart Tofms And Machine Learning, Laura Ahumada, Erin R. Mcclure-Price, Chad Kwong, Edgard O. Espinoza, John Santerre

SMU Data Science Review

Hair is found in over 90% of crime scenes and has long been analyzed as trace evidence. However, recent reviews of traditional hair fiber analysis techniques, primarily morphological examination, have cast doubt on its reliability. To address these concerns, this study employed machine learning algorithms, specifically Linear Discriminant Analysis (LDA) and Random Forest, on Direct Analysis in Real Time time-of-flight mass spectra collected from human, cat, and dog hair samples. The objective was to develop a chemistry- and statistics-based classification method for unbiased taxonomic identification of hair. The results of the study showed that LDA and Random Forest were highly …


Bayesian Statistical Modeling Of Spatially Resolved Transcriptomics Data, Xi Jiang Oct 2023

Bayesian Statistical Modeling Of Spatially Resolved Transcriptomics Data, Xi Jiang

Statistical Science Theses and Dissertations

Spatially resolved transcriptomics (SRT) quantifies expression levels at different spatial locations, providing a new and powerful tool to investigate novel biological insights. As experimental technologies enhance both in capacity and efficiency, there arises a growing demand for the development of analytical methodologies.

One question in SRT data analysis is to identify genes whose expressions exhibit spatially correlated patterns, called spatially variable (SV) genes. Most current methods to identify SV genes are built upon the geostatistical model with Gaussian process, which could limit the models' ability to identify complex spatial patterns. In order to overcome this challenge and capture more types …


Reu-Deim Classification Of Hispanic Voters In Hispanic Groups Using Name And Zip Code Data In Palm Beach, Florida, Kamila Soto-Ortiz Sep 2023

Reu-Deim Classification Of Hispanic Voters In Hispanic Groups Using Name And Zip Code Data In Palm Beach, Florida, Kamila Soto-Ortiz

Beyond: Undergraduate Research Journal

When it comes to registering to vote, Hispanic voters can only register as “Hispanic” in the “Race/Ethnicity” category, causing difficulties when analyzing voting trends amongst the Hispanic community. Upon the recent idea that not all Hispanic Groups vote the same, the goal is to create a model that can possibly identify a voter’s Hispanic Group with the information provided on the public Florida voter file. This is accomplished using name and zip code data for all voters in Palm Beach, Florida. This paper will explore the model implemented, its findings and limitations. Palm Beach, Florida, is met with low confidence …


Sentiment Analysis Before And During The Covid-19 Pandemic, Emily Musgrove Jul 2023

Sentiment Analysis Before And During The Covid-19 Pandemic, Emily Musgrove

Mathematics Summer Fellows

This study examines the change in connotative language use before and during the Covid-19 pandemic. By analyzing news articles from several major US newspapers, we found that there is a statistically significant correlation between the sentiment of the text and the publication period. Specifically, we document a large, systematic, and statistically significant decline in the overall sentiment of articles published in major news outlets. While our results do not directly gauge the sentiment of the population, our findings have important implications regarding the social responsibility of journalists and media outlets especially in times of crisis.


A Comparison Of Confidence Intervals In State Space Models, Jinyu Du Jul 2023

A Comparison Of Confidence Intervals In State Space Models, Jinyu Du

Statistical Science Theses and Dissertations

This thesis develops general procedures for constructing confidence intervals (CIs) of the error disturbance parameters (standard deviations) and transformations of the error disturbance parameters in time-invariant state space models (ssm). With only a set of observations, estimating individual error disturbance parameters accurately in the presence of other unknown parameters in ssm is a very challenging problem. We attempted to construct four different types of confidence intervals, Wald, likelihood ratio, score, and higher-order asymptotic intervals for both the simple local level model and the general time-invariant state space models (ssm). We show that for a simple local level model, both the …


Optimizing Tumor Xenograft Experiments Using Bayesian Linear And Nonlinear Mixed Modelling And Reinforcement Learning, Mary Lena Bleile May 2023

Optimizing Tumor Xenograft Experiments Using Bayesian Linear And Nonlinear Mixed Modelling And Reinforcement Learning, Mary Lena Bleile

Statistical Science Theses and Dissertations

Tumor xenograft experiments are a popular tool of cancer biology research. In a typical such experiment, one implants a set of animals with an aliquot of the human tumor of interest, applies various treatments of interest, and observes the subsequent response. Efficient analysis of the data from these experiments is therefore of utmost importance. This dissertation proposes three methods for optimizing cancer treatment and data analysis in the tumor xenograft context. The first of these is applicable to tumor xenograft experiments in general, and the second two seek to optimize the combination of radiotherapy with immunotherapy in the tumor xenograft …


Modeling The Probability Of A Successful Stolen Base Attempt In Major League Baseball, Cade Stanley Apr 2023

Modeling The Probability Of A Successful Stolen Base Attempt In Major League Baseball, Cade Stanley

Senior Theses

In Major League Baseball (MLB), the outcome of a stolen base attempt has important implications. Success moves the runner closer to scoring, while failure records an out and removes the runner from the basepaths altogether. Therefore, it is important that the decision by a coach or player to steal a base is well-informed. In this thesis, I explore a statistical approach to making this decision. I train logistic regression and random forest models, using data about the game situation and about the runner, pitcher, and catcher involved in the stolen base attempt, to estimate the probability that a stolen base …


Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun Aug 2022

Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun

Statistical Science Theses and Dissertations

Alternating recurrent events data arise commonly in health research; examples include hospital admissions and discharges of diabetes patients; exacerbations and remissions of chronic bronchitis; and quitting and restarting smoking. Recent work has involved formulating and estimating joint models for the recurrent event times considering non-negligible event durations. However, prediction models for transition between recurrent events are lacking. We consider the development and evaluation of methods for predicting future events within these models. Specifically, we propose a tool for dynamically predicting transition between alternating recurrent events in real time. Under a flexible joint frailty model, we derive the predictive probability of …


Generating A Dataset For Comparing Linear Vs. Non-Linear Prediction Methods In Education Research, Jack Mauro, Elena Martinez, Anna Bargagliotti May 2022

Generating A Dataset For Comparing Linear Vs. Non-Linear Prediction Methods In Education Research, Jack Mauro, Elena Martinez, Anna Bargagliotti

Honors Thesis

Machine learning is often used to build predictive models by extracting patterns from large data sets. Such techniques are increasingly being utilized to predict outcomes in the social sciences. One such application is predicting student success. Machine learning can be applied to predicting student acceptance and success in academia. Using these tools for education-related data analysis, may enable the evaluation of programs, resources and curriculum. Currently, research is needed to examine application, admissions, and retention data in order to address equity in college computer science programs. However, most student-level data sets contain sensitive data that cannot be made public. To …


Understanding And Improving The System: The Effects Of Weighting On The Accuracy Of Political Polling In Arkansas, Beck Williams May 2022

Understanding And Improving The System: The Effects Of Weighting On The Accuracy Of Political Polling In Arkansas, Beck Williams

Political Science Undergraduate Honors Theses

In an effort to increase the accuracy of statewide political polling in Arkansas, we explore the statistical strategy of weighting with a focus on one yearly opinion poll: The Arkansas Poll. We conduct over 70 weighting experiments on the 2016 and 2020 Arkansas Polls using a variety of variables and opinion questions. From these experiments, we find that while some weighted variables tend to create larger changes, weighting typically results in a single-digit percentage change that does not substantially shift or “flip” the majorities. Due to a greater rate of change through weighting in the 2020 Poll compared to the …


A Monte Carlo Analysis Of Seven Dichotomous Variable Confidence Interval Equations, Morgan Juanita Dubose Apr 2022

A Monte Carlo Analysis Of Seven Dichotomous Variable Confidence Interval Equations, Morgan Juanita Dubose

Masters Theses & Specialist Projects

Department of Psychological Sciences Western Kentucky University There are two options to estimate a range of likely values for the population mean of a continuous variable: one for when the population standard deviation is known and another for when the population standard deviation is unknown. There are seven proposed equations to calculate the confidence interval for the population mean of a dichotomous variable: normal approximation interval, Wilson interval, Jeffreys interval, Clopper-Pearson, Agresti-Coull, arcsine transformation, and logit transformation. In this study, I compared the percent effectiveness of each equation using a Monte Carlo analysis and the interval range over a range …


Finding The Best Predictors For Foot Traffic In Us Seafood Restaurants, Isabel Paige Beaulieu Jan 2022

Finding The Best Predictors For Foot Traffic In Us Seafood Restaurants, Isabel Paige Beaulieu

Honors Theses and Capstones

COVID-19 caused state and nation-wide lockdowns, which altered human foot traffic, especially in restaurants. The seafood sector in particular suffered greatly as there was an increase in illegal fishing, it is made up of perishable goods, it is seasonal in some places, and imports and exports were slowed. Foot traffic data is useful for business owners to have to know how much to order, how many employees to schedule, etc. One issue is that the data is very expensive, hard to get, and not available until months after it is recorded. Our goal is to not only find covariates that …


Trade Bait: Season 3, Ben Bagley Oct 2021

Trade Bait: Season 3, Ben Bagley

WWU Honors College Senior Projects

A 5-episode podcast series dissecting the use of statistics in the NFL and NFL Media


An Introduction To Calling Bullshit: Learning To Think Outside The Black Box, Jevin D. West, Carl T. Bergstrom Aug 2021

An Introduction To Calling Bullshit: Learning To Think Outside The Black Box, Jevin D. West, Carl T. Bergstrom

Numeracy

Bergstrom, Carl T. and Jevin D. West. 2020. Calling Bullshit: The Art of Skepticism in a Data-Driven World. (New York: Random House) 336 pp. ISBN 978-0525509202.

While statistical methods receive greater attention, the art of critically evaluating information in everyday life more commonly depends on thinking outside the black box of the algorithm. In this piece we introduce readers to our book and associated online teaching materials—for readers who want to more capably call “bullshit” or to teach their students to do the same.


Data Analysis And Visualization To Dismantle Gender Discrimination In The Field Of Technology, Quinn Bolewicki Jun 2021

Data Analysis And Visualization To Dismantle Gender Discrimination In The Field Of Technology, Quinn Bolewicki

Dissertations, Theses, and Capstone Projects

In the United States, a significant population is facing an uphill battle trying to thrive in an industry that has seen exponential growth in recent years. Women, who account for approximately 50.8% of the U.S. population are statistically underpaid and underrepresented in science, technology, engineering, and mathematics (STEM). Despite women-led technology teams establishing a 21% greater return on investment than teams who don’t, and young women largely outperforming men in math according to a 2015 study, there are only three fortune 500 companies led by women, and they comprise only 10% of internet entrepreneurs. Research generates hundreds of articles, infographics, …


Guidelines For Regression Analysis In Sas And R: A Case Study, Sarah Milligan May 2021

Guidelines For Regression Analysis In Sas And R: A Case Study, Sarah Milligan

Honors Program Theses and Projects

When a player is a free agent, an individual who is able to sign to any team, one wonders what their best option is. Will signing with Team A or Team B provide them with the largest salary? What factors will affect their salary the most? Does last year’s statistics have a strong impact on next year’s salary? These questions can be answered by performing a regression analysis on previous years data. The primary focus of this project is to determine the most important variables related to an NBA salary. Likewise, the statistical programs SAS and R will be compared …


How Risk-Related Statistics, As Reported In News And Social Media, Are Linked To The Use Of The Public Transit System, Prashiddhi Pokhrel Apr 2021

How Risk-Related Statistics, As Reported In News And Social Media, Are Linked To The Use Of The Public Transit System, Prashiddhi Pokhrel

Thinking Matters Symposium

Due to the pandemic, people have started relying more on televisions, news, social media, and other news outlets for guidance. Moreover, with the increasing amount of news, data, and information there is also an increase in the amount of misleading statistics. People’s opinions and decisions significantly depend on the data, statistics, and information that they are exposed to, as well as their sources. For this project, we want to look at how information and its sources are affecting the decision made by the general public for the usage of the Portland Transit System. It is very important to know why …


Fourth Down Decision Making: Challenging The Conservative Nature Of Nfl Coaches, Will Palmquist, Ryan Elmore, Benjamin Williams Jan 2021

Fourth Down Decision Making: Challenging The Conservative Nature Of Nfl Coaches, Will Palmquist, Ryan Elmore, Benjamin Williams

DU Undergraduate Research Journal Archive

This thesis analyzes the hypothesis that coaches in the National Football League are often too conservative in their decision making on fourth downs. I used R Studio and NFL play-by-play data to simulate actual football plays and drives according to different fourth down strategies. By measuring expected points per drive over thousands of simulated drives, we are able to evaluate the effectiveness of different fourth down strategies. This research points to a number of conclusions regarding the nature of NFL coaches on fourth downs as well as the complexity of modeling and simulating decision making in a complex sport such …


Ensemble Protein Inference Evaluation, Kyle Lee Lucke Jan 2021

Ensemble Protein Inference Evaluation, Kyle Lee Lucke

Graduate Student Theses, Dissertations, & Professional Papers

The Protein inference problem is becoming an increasingly important tool that aids in the characterization of complex proteomes and analysis of complex protein samples. In bottom-up shotgun proteomics experiments the metrics for evaluation (like AUC and calibration error) are based on an often imperfect target-decoy database. These metrics make the inherent assumption that all of the proteins in the target set are present in the sample being analyzed. In general, this is not the case, they are typically a mix of present and absent proteins. To objectively evaluate inference methods, protein standard datasets are used. These datasets are special in …


Bayesian Semi-Supervised Keyphrase Extraction And Jackknife Empirical Likelihood For Assessing Heterogeneity In Meta-Analysis, Guanshen Wang Dec 2020

Bayesian Semi-Supervised Keyphrase Extraction And Jackknife Empirical Likelihood For Assessing Heterogeneity In Meta-Analysis, Guanshen Wang

Statistical Science Theses and Dissertations

This dissertation investigates: (1) A Bayesian Semi-supervised Approach to Keyphrase Extraction with Only Positive and Unlabeled Data, (2) Jackknife Empirical Likelihood Confidence Intervals for Assessing Heterogeneity in Meta-analysis of Rare Binary Events.

In the big data era, people are blessed with a huge amount of information. However, the availability of information may also pose great challenges. One big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyphrase extraction methods summarize an article by identifying a list of keyphrases. Many existing keyphrase extraction methods focus on the unsupervised setting, …


Improved Statistical Methods For Time-Series And Lifetime Data, Xiaojie Zhu Dec 2020

Improved Statistical Methods For Time-Series And Lifetime Data, Xiaojie Zhu

Statistical Science Theses and Dissertations

In this dissertation, improved statistical methods for time-series and lifetime data are developed. First, an improved trend test for time series data is presented. Then, robust parametric estimation methods based on system lifetime data with known system signatures are developed.

In the first part of this dissertation, we consider a test for the monotonic trend in time series data proposed by Brillinger (1989). It has been shown that when there are highly correlated residuals or short record lengths, Brillinger’s test procedure tends to have significance level much higher than the nominal level. This could be related to the discrepancy between …


Applying The Data: Predictive Analytics In Sport, Anthony Teeter, Margo Bergman Nov 2020

Applying The Data: Predictive Analytics In Sport, Anthony Teeter, Margo Bergman

Access*: Interdisciplinary Journal of Student Research and Scholarship

The history of wagering predictions and their impact on wide reaching disciplines such as statistics and economics dates to at least the 1700’s, if not before. Predicting the outcomes of sports is a multibillion-dollar business that capitalizes on these tools but is in constant development with the addition of big data analytics methods. Sportsline.com, a popular website for fantasy sports leagues, provides odds predictions in multiple sports, produces proprietary computer models of both winning and losing teams, and provides specific point estimates. To test likely candidates for inclusion in these prediction algorithms, the authors developed a computer model, and test …


Bayesian Topological Machine Learning, Christopher A. Oballe Aug 2020

Bayesian Topological Machine Learning, Christopher A. Oballe

Doctoral Dissertations

Topological data analysis encompasses a broad set of ideas and techniques that address 1) how to rigorously define and summarize the shape of data, and 2) use these constructs for inference. This dissertation addresses the second problem by developing new inferential tools for topological data analysis and applying them to solve real-world data problems. First, a Bayesian framework to approximate probability distributions of persistence diagrams is established. The key insight underpinning this framework is that persistence diagrams may be viewed as Poisson point processes with prior intensities. With this assumption in hand, one may compute posterior intensities by adopting techniques …


Using Stability To Select A Shrinkage Method, Dean Dustin May 2020

Using Stability To Select A Shrinkage Method, Dean Dustin

Department of Statistics: Dissertations, Theses, and Student Work

Shrinkage methods are estimation techniques based on optimizing expressions to find which variables to include in an analysis, typically a linear regression. The general form of these expressions is the sum of an empirical risk plus a complexity penalty based on the number of parameters. Many shrinkage methods are known to satisfy an ‘oracle’ property meaning that asymptotically they select the correct variables and estimate their coefficients efficiently. In Section 1.2, we show oracle properties in two general settings. The first uses a log likelihood in place of the empirical risk and allows a general class of penalties. The second …


Analysis Of Gas Mileage Of A Car, Joshua Ballard-Myer Apr 2020

Analysis Of Gas Mileage Of A Car, Joshua Ballard-Myer

Georgia College Student Research Events

The objective of this work is to analyze a data set, Auto, from the R package ISLR: Introduction to Statistical Learning in R. The data set includes information for 392 observations on 9 variables including gas mileage, horsepower, weight in pounds, and engine displacement in cubic inches. The data set was taken from the StatLib library maintained at Carnegie Mellon University. The primary response variable will be gas mileage in miles per gallon, with all other variables serving as predictors, but other relationships with other response variables such as acceleration will be explored. Results were similar to expected; traits desirable …


The Importance Of Type I Error Rates When Studying Bias In Monte Carlo Studies In Statistics, Michael Harwell Feb 2020

The Importance Of Type I Error Rates When Studying Bias In Monte Carlo Studies In Statistics, Michael Harwell

Journal of Modern Applied Statistical Methods

Two common outcomes of Monte Carlo studies in statistics are bias and Type I error rate. Several versions of bias statistics exist but all employ arbitrary cutoffs for deciding when bias is ignorable or non-ignorable. This article argues Type I error rates should be used when assessing bias.


The Author’S Reflections On No B.S. (Bad Stats): Black People Need People Who Believe In Black People Enough Not To Believe Every Bad Thing They Hear About Black People, Ivory A. Toldson Jan 2020

The Author’S Reflections On No B.S. (Bad Stats): Black People Need People Who Believe In Black People Enough Not To Believe Every Bad Thing They Hear About Black People, Ivory A. Toldson

Numeracy

Toldson, Ivory. A. 2019. No BS (Bad Stats): Black People Need People Who Believe in Black People Enough Not to Believe Every Bad Thing They Hear About Black People (Boston, MA: Brill-Sense) 194 pp. ISBN 978-9004397026.

This essay provides an introduction to No BS (Bad Stats): Black People Need People Who Believe in Black People Enough Not to Believe Every Bad Thing They Hear About Black People. In the essay, the author discusses how cynical views about the educational potential of Black children motivated him to write a book that challenges negative statistics. The essay also outlines the harmful …


Power Analysis On A Pilot Study Of The Caloric Intake Of Children Helping Prepare Meals Versus Children Not, Danielle Clifford Jan 2020

Power Analysis On A Pilot Study Of The Caloric Intake Of Children Helping Prepare Meals Versus Children Not, Danielle Clifford

Student Research Poster Presentations 2020

The purpose of this analysis is to determine the sample size needed for a study that will be used to discover if there is a difference in the caloric intake of children who help with meal preparation and children who do not help with meal preparation.


Predicting Diabetes Diagnoses, Sarah Netchert Jan 2020

Predicting Diabetes Diagnoses, Sarah Netchert

Student Research Poster Presentations 2020

This study explored the traits and health state of African Americans in central Virginia in order to determine what traits put people at a higher probability of being diagnosed with diabetes. We also want to know which traits will generate the highest probability a person will be diagnosed with diabetes. Traits that were included and used in this study were cholesterol, stabilized glucose, high density lipoprotein levels, age(years), gender, height(inches), weight(pounds), systolic blood pressure, diastolic blood pressure, waist size(inches), and hip size(inches). There were 403 individuals included in study since they were only ones screened for diabetes out of 1,046 …


The Role Of Topography, Soil, And Remotely Sensed Vegetation Condition Towards Predicting Crop Yield, Trenton E. Franz, Sayli Pokal, Justin P. Gibson, Yuzhen Zhou, Hamed Gholizadeh, Fatima Amor Tenorio, Daran Rudnick, Derek M. Heeren, Matthew F. Mccabe, Matteo Ziliani, Zhenong Jin, Kaiyu Guan, Ming Pan, John Gates, Brian Wardlow Jan 2020

The Role Of Topography, Soil, And Remotely Sensed Vegetation Condition Towards Predicting Crop Yield, Trenton E. Franz, Sayli Pokal, Justin P. Gibson, Yuzhen Zhou, Hamed Gholizadeh, Fatima Amor Tenorio, Daran Rudnick, Derek M. Heeren, Matthew F. Mccabe, Matteo Ziliani, Zhenong Jin, Kaiyu Guan, Ming Pan, John Gates, Brian Wardlow

School of Natural Resources: Faculty Publications

Foreknowledge of the spatiotemporal drivers of crop yield would provide a valuable source of information to optimize on-farm inputs and maximize profitability. In recent years, an abundance of spatial data providing information on soils, topography, and vegetation condition have become available from both proximal and remote sensing platforms. Given the wide range of data costs (between USD $0−50/ha), it is important to understand where often limited financial resources should be directed to optimize field production. Two key questions arise. First, will these data actually aid in better fine-resolution yield prediction to help optimize crop management and farm economics? Second, what …