Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistics

2022

Discipline
Institution
Publication
Publication Type

Articles 1 - 30 of 32

Full-Text Articles in Physical Sciences and Mathematics

Regression Modeling Of Complex Survival Data Based On Pseudo-Observations, Rong Rong Dec 2022

Regression Modeling Of Complex Survival Data Based On Pseudo-Observations, Rong Rong

Statistical Science Theses and Dissertations

The restricted mean survival time (RMST) is a clinically meaningful summary measure in studies with survival outcomes. Statistical methods have been developed for regression analysis of RMST to investigate impacts of covariates on RMST, which is a useful alternative to the Cox regression analysis. However, existing methods for regression modeling of RMST are not applicable to left-truncated right-censored data that arise frequently in prevalent cohort studies, for which the sampling bias due to left truncation and informative censoring induced by the prevalent sampling scheme must be properly addressed. Meanwhile, statistical methods have been developed for regression modeling of the cumulative …


Automs: Automatic Model Selection For Novelty Detection With Error Rate Control, Yifan Zhang, Haiyan Jiang, Haojie Ren, Changliang Zou, Dejing Dou Dec 2022

Automs: Automatic Model Selection For Novelty Detection With Error Rate Control, Yifan Zhang, Haiyan Jiang, Haojie Ren, Changliang Zou, Dejing Dou

Machine Learning Faculty Publications

Given an unsupervised novelty detection task on a new dataset, how can we automatically select a “best” detection model while simultaneously controlling the error rate of the best model? For novelty detection analysis, numerous detectors have been proposed to detect outliers on a new unseen dataset based on a score function trained on available clean data. However, due to the absence of labeled anomalous data for model evaluation and comparison, there is a lack of systematic approaches that are able to select the “best” model/detector (i.e., the algorithm as well as its hyperparameters) and achieve certain error rate control simultaneously. …


Bayesian Estimation Of The Intensity Function Of A Non-Homogeneous Poisson Process, James Jensen Oct 2022

Bayesian Estimation Of The Intensity Function Of A Non-Homogeneous Poisson Process, James Jensen

Theses

In this paper we explore Bayesian inference and its application to the problem of estimating the intensity function of a non-homogeneous Poisson process. These processes model the behavior of phenomena in which one or more events, known as arrivals, occur independently of one another over a certain period of time. We are concerned with the number of events occurring during particular time intervals across several realizations of the process. We show that given sufficient data, we are able to construct a piecewise-constant function which accurately estimates the mean rates on particular intervals. Further, we show that as we reduce these …


Distance Based Image Classification: A Solution To Generative Classification’S Conundrum?, Wen-Yan Lin, Siying Liu, Bing Tian Dai, Hongdong Li Sep 2022

Distance Based Image Classification: A Solution To Generative Classification’S Conundrum?, Wen-Yan Lin, Siying Liu, Bing Tian Dai, Hongdong Li

Research Collection School Of Computing and Information Systems

Most classifiers rely on discriminative boundaries that separate instances of each class from everything else. We argue that discriminative boundaries are counter-intuitive as they define semantics by what-they-are-not; and should be replaced by generative classifiers which define semantics by what-they-are. Unfortunately, generative classifiers are significantly less accurate. This may be caused by the tendency of generative models to focus on easy to model semantic generative factors and ignore non-semantic factors that are important but difficult to model. We propose a new generative model in which semantic factors are accommodated by shell theory’s [25] hierarchical generative process and non-semantic factors by …


The Microscopical Evidence Traces Analysis Of Household Dust And Its Statistical Significance As A Definitive Identification Technique, Stephanie Polifroni Sep 2022

The Microscopical Evidence Traces Analysis Of Household Dust And Its Statistical Significance As A Definitive Identification Technique, Stephanie Polifroni

Dissertations, Theses, and Capstone Projects

Evidence found at crime scenes is used to assist in creating a link the suspect, the victim, and the scene. As stated by the Locard’s Principle, every contact leaves a trace, that evidence can be used to link together an investigation. Traces are collected in hopes that they can be identified and associated to an individual or individuals to help solve that particular crime. However, the strongest conclusion for evidence traces is an association to a source, and unless a physical match of some kind is found, an individualization cannot be established even when known sample is available. However, having …


Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun Aug 2022

Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun

Statistical Science Theses and Dissertations

Alternating recurrent events data arise commonly in health research; examples include hospital admissions and discharges of diabetes patients; exacerbations and remissions of chronic bronchitis; and quitting and restarting smoking. Recent work has involved formulating and estimating joint models for the recurrent event times considering non-negligible event durations. However, prediction models for transition between recurrent events are lacking. We consider the development and evaluation of methods for predicting future events within these models. Specifically, we propose a tool for dynamically predicting transition between alternating recurrent events in real time. Under a flexible joint frailty model, we derive the predictive probability of …


Efficient Approaches To Steady State Detection In Multivariate Systems, Honglun Xu Aug 2022

Efficient Approaches To Steady State Detection In Multivariate Systems, Honglun Xu

Open Access Theses & Dissertations

Steady state detection is critically important in many engineering fields such as fault detection and diagnosis, process monitoring and control. However, most of the existing methods are designed for univariate signals. In this dissertation, we proposed an efficient online steady state detection method for multivariate systems through a sequential Bayesian partitioning approach. The signal is modeled by a Bayesian piecewise constant mean and covariance model, and a recursive updating method is developed to calculate the posterior distributions analytically. The duration of the current segment is utilized to test the steady state. Insightful guidance is provided for hyperparameter selection. The effectiveness …


Characterizing Wildfire In The Frank Church Wilderness, Idaho, Between 1972-2012, Abigail Christine Axness Aug 2022

Characterizing Wildfire In The Frank Church Wilderness, Idaho, Between 1972-2012, Abigail Christine Axness

Boise State University Theses and Dissertations

I examined wildfire characteristics in the Frank Church Wilderness, central Idaho, between 1972-2012. Studying fire characteristics in the Frank Church Wilderness provides an opportunity to understand the history of wildfires in a federally designated wilderness area, largely devoid of management impacts with limited human access and activity. The ~958,000-hectare Frank Church Wilderness area encompasses the Middle Fork Salmon River. Vegetation cover ranges from high elevation (~2500-3200 meters) mixed conifer forests in the headwaters to low-elevation (~600-1000 meters) sagebrush-steppe and ponderosa pine (Pinus Ponderosa) forests. The Frank Church Wilderness is defined as unmanaged because effective fire suppression (e.g., vehicle …


Spurious Correlation Sestina, Jules Nyquist Jul 2022

Spurious Correlation Sestina, Jules Nyquist

Journal of Humanistic Mathematics

This is a sestina poem about Spurious Correlations with a magical realism angle for beginning students learning statistics for the first time during the COVID pandemic.


Reconstructing Historical Earthquake-Induced Tsunamis: Case Study Of 1820 Event Near South Sulawesi, Indonesia, Taylor Jole Paskett Jul 2022

Reconstructing Historical Earthquake-Induced Tsunamis: Case Study Of 1820 Event Near South Sulawesi, Indonesia, Taylor Jole Paskett

Theses and Dissertations

We build on the method introduced by Ringer, et al., applying it to an 1820 event that happened near South Sulawesi, Indonesia. We utilize other statistical models to aid our Metropolis-Hastings sampler, including a Gaussian process which informs the prior. We apply the method to multiple possible fault zones to determine which fault is the most likely source of the earthquake and tsunami. After collecting nearly 80,000 samples, we find that between the two most likely fault zones, the Walanae fault zone matches the anecdotal accounts much better than Flores. However, to support the anecdotal data, both samplers tend toward …


Applications Of Machine Learning Algorithms In Materials Science And Bioinformatics, Mohammed Quazi Jun 2022

Applications Of Machine Learning Algorithms In Materials Science And Bioinformatics, Mohammed Quazi

Mathematics & Statistics ETDs

The piezoelectric response has been a measure of interest in density functional theory (DFT) for micro-electromechanical systems (MEMS) since the inception of MEMS technology. Piezoelectric-based MEMS devices find wide applications in automobiles, mobile phones, healthcare devices, and silicon chips for computers, to name a few. Piezoelectric properties of doped aluminum nitride (AlN) have been under investigation in materials science for piezoelectric thin films because of its wide range of device applicability. In this research using rigorous DFT calculations, high throughput ab-initio simulations for 23 AlN alloys are generated.

This research is the first to report strong enhancements of piezoelectric properties …


Generating A Dataset For Comparing Linear Vs. Non-Linear Prediction Methods In Education Research, Jack Mauro, Elena Martinez, Anna Bargagliotti May 2022

Generating A Dataset For Comparing Linear Vs. Non-Linear Prediction Methods In Education Research, Jack Mauro, Elena Martinez, Anna Bargagliotti

Honors Thesis

Machine learning is often used to build predictive models by extracting patterns from large data sets. Such techniques are increasingly being utilized to predict outcomes in the social sciences. One such application is predicting student success. Machine learning can be applied to predicting student acceptance and success in academia. Using these tools for education-related data analysis, may enable the evaluation of programs, resources and curriculum. Currently, research is needed to examine application, admissions, and retention data in order to address equity in college computer science programs. However, most student-level data sets contain sensitive data that cannot be made public. To …


The Efficacy Of The Covid-19 Vaccine In Mississippi, Ilyse Miriam Levy May 2022

The Efficacy Of The Covid-19 Vaccine In Mississippi, Ilyse Miriam Levy

Honors Theses

The Efficacy of The COVID-19 Vaccine in Mississippi

(Under the direction of Dr. Xin Dang)

By tracking and analyzing fifty-three weeks of COVID-19 data, this thesis analyzes the efficacy of the COVID-19 vaccine within the State of Mississippi. Over the course of these fifty-three weeks, I have also been able to calculate the confidence intervals for vaccination efficacy and the risk reduction due to vaccination by using data regarding the correlations between deaths and vaccination status, provided to me by the Mississippi Office of Epidemiology. My analysis demonstrates that the COVID-19 vaccine is effective not only in Mississippi but also …


An Examination Of The Statistics And Risk Management Concepts Behind The Patient Protection And Affordable Care Act (Ppaca) Of 2010, Scott Sinclair May 2022

An Examination Of The Statistics And Risk Management Concepts Behind The Patient Protection And Affordable Care Act (Ppaca) Of 2010, Scott Sinclair

Undergraduate Honors Thesis Collection

The Patient Protection and Affordable Care Act (PPACA) is the overarching federal law that has impacted the intricacies of the health insurance market for more than a decade. Using the supervised learning method of multiple linear regression, the relationship between the medical loss ratio rebates and predictor variables such as the state, health insurance market, and the number of insurance companies owing rebates will be analyzed, along with the actuarial value of metal tiers and geographic rating area factors in terms of their relationship to the insurance premium for a standard family of four, defined as a forty-year-old couple with …


Forecasting Razorback Baseball Game Outcomes, Austin Raabe May 2022

Forecasting Razorback Baseball Game Outcomes, Austin Raabe

Information Systems Undergraduate Honors Theses

Despite the disappointing end to the 2021 Arkansas Razorback baseball year, the team’s success provided hog fans something to look forward to next season. While they will be without the 2021 Golden Spikes Award winner, Kevin Kopps, and four All-SEC team selections, the 2022 roster has promising new and returning talent. With fifty percent of the players who played significant time last year coming back (minimum ten hits or ten innings pitched), the arrival of several impact transfers from major conferences, and a recruiting class ranked in the top five according to Perfect Game, there is reason to believe that …


Understanding And Improving The System: The Effects Of Weighting On The Accuracy Of Political Polling In Arkansas, Beck Williams May 2022

Understanding And Improving The System: The Effects Of Weighting On The Accuracy Of Political Polling In Arkansas, Beck Williams

Political Science Undergraduate Honors Theses

In an effort to increase the accuracy of statewide political polling in Arkansas, we explore the statistical strategy of weighting with a focus on one yearly opinion poll: The Arkansas Poll. We conduct over 70 weighting experiments on the 2016 and 2020 Arkansas Polls using a variety of variables and opinion questions. From these experiments, we find that while some weighted variables tend to create larger changes, weighting typically results in a single-digit percentage change that does not substantially shift or “flip” the majorities. Due to a greater rate of change through weighting in the 2020 Poll compared to the …


Causalmodels: An R Library For Estimating Causal Effects, Joshua Wolff Anderson May 2022

Causalmodels: An R Library For Estimating Causal Effects, Joshua Wolff Anderson

Computational and Data Sciences (MS) Theses

Free and open source software for statistical modeling and machine learning have advanced productivity in data science significantly. Packages such as SciPy in Python and caret in R provide fundamental tools for statistical modeling and machine learning in the two most popular programming languages used by data scientists. Unfortunately, robust tools similar to these are limited in terms of causal inference. The tools in R that exist lack consistent and standardized methodologies and inputs. R lacks a comprehensive package that offers traditional causal inference methods such as standardization, IP weighting, G-estimation, outcome regression, and propensity matching in one common package. …


On Misuses Of The Kolmogorov–Smirnov Test For One-Sample Goodness-Of-Fit, Anthony Zeimbekakis Apr 2022

On Misuses Of The Kolmogorov–Smirnov Test For One-Sample Goodness-Of-Fit, Anthony Zeimbekakis

Honors Scholar Theses

The Kolmogorov–Smirnov (KS) test is one of the most popular goodness-of-fit tests for comparing a sample with a hypothesized parametric distribution. Nevertheless, it has often been misused. The standard one-sample KS test applies to independent, continuous data with a hypothesized distribution that is completely specified. It is not uncommon, however, to see in the literature that it was applied to dependent, discrete, or rounded data, with hypothesized distributions containing estimated parameters. For example, it has been "discovered" multiple times that the test is too conservative when the parameters are estimated. We demonstrate misuses of the one-sample KS test in three …


Analytical Study To Determine Significant Causes Of Increased No-Hitters In The 2021 Major League Baseball Season, Joel Robison Apr 2022

Analytical Study To Determine Significant Causes Of Increased No-Hitters In The 2021 Major League Baseball Season, Joel Robison

Honors Projects

Why were there so many no-hitters in the 2021 MLB season? This project focuses on possible significant causes to the record-breaking number of no-hitters pitched in the 2021 Major League Baseball season. Specifically, this project takes an analytical look at the recent trends in launch angles and spin rates to determine if there are any significant causes to the increased number of no-hitters in baseball. The random nature and unpredictability of the game of baseball make it almost impossible to come to any solid conclusions.


Einstein-Roscoe Regression For The Slag Viscosity Prediction Problem In Steelmaking, Hiroto Saigo, Dukka Kc, Noritaka Saito Apr 2022

Einstein-Roscoe Regression For The Slag Viscosity Prediction Problem In Steelmaking, Hiroto Saigo, Dukka Kc, Noritaka Saito

Michigan Tech Publications

In classical machine learning, regressors are trained without attempting to gain insight into the mechanism connecting inputs and outputs. Natural sciences, however, are interested in finding a robust interpretable function for the target phenomenon, that can return predictions even outside of the training domains. This paper focuses on viscosity prediction problem in steelmaking, and proposes Einstein-Roscoe regression (ERR), which learns the coefficients of the Einstein-Roscoe equation, and is able to extrapolate to unseen domains. Besides, it is often the case in the natural sciences that some measurements are unavailable or expensive than the others due to physical constraints. To this …


A Monte Carlo Analysis Of Seven Dichotomous Variable Confidence Interval Equations, Morgan Juanita Dubose Apr 2022

A Monte Carlo Analysis Of Seven Dichotomous Variable Confidence Interval Equations, Morgan Juanita Dubose

Masters Theses & Specialist Projects

Department of Psychological Sciences Western Kentucky University There are two options to estimate a range of likely values for the population mean of a continuous variable: one for when the population standard deviation is known and another for when the population standard deviation is unknown. There are seven proposed equations to calculate the confidence interval for the population mean of a dichotomous variable: normal approximation interval, Wilson interval, Jeffreys interval, Clopper-Pearson, Agresti-Coull, arcsine transformation, and logit transformation. In this study, I compared the percent effectiveness of each equation using a Monte Carlo analysis and the interval range over a range …


Mixture Models In Machine Learning, Soumyabrata Pal Mar 2022

Mixture Models In Machine Learning, Soumyabrata Pal

Doctoral Dissertations

Modeling with mixtures is a powerful method in the statistical toolkit that can be used for representing the presence of sub-populations within an overall population. In many applications ranging from financial models to genetics, a mixture model is used to fit the data. The primary difficulty in learning mixture models is that the observed data set does not identify the sub-population to which an individual observation belongs. Despite being studied for more than a century, the theoretical guarantees of mixture models remain unknown for several important settings. In this thesis, we look at three groups of problems. The first part …


Split Classification Model For Complex Clustered Data, Katherine Gerot Mar 2022

Split Classification Model For Complex Clustered Data, Katherine Gerot

Honors Theses

Classification in high-dimensional data has generated tremendous interest in a multitude of fields. Data in higher dimensions often tend to reside in non-Euclidean metric space. This prevents Euclidean-based classification methodologies, such as regression, from reliably modeling the data. Many proposed models rely on computationally-complex embedding to convert the data to a more usable format. Others, namely the Support Vector Machine, rely on kernel manipulation to implicitly describe the "feature space" to arrive at a non-linear decision boundary. The proposed methodology in this paper seeks to classify complex data in a relatively computationally-simple and explainable manner.


So Long My Friend, Bryan Mcnair Jan 2022

So Long My Friend, Bryan Mcnair

Journal of Humanistic Mathematics

No abstract provided.


Many-Objective Evolutionary Algorithms: Objective Reduction, Decomposition And Multi-Modality., Monalisa Pal Dr. Jan 2022

Many-Objective Evolutionary Algorithms: Objective Reduction, Decomposition And Multi-Modality., Monalisa Pal Dr.

Doctoral Theses

Evolutionary Algorithms (EAs) for Many-Objective Optimization (MaOO) problems are challenging in nature due to the requirement of large population size, difficulty in maintaining the selection pressure towards global optima and inability of accurate visualization of high-dimensional Pareto-optimal Set (in decision space) and Pareto-Front (in objective space). The quality of the estimated set of Pareto-optimal solutions, resulting from the EAs for MaOO problems, is assessed in terms of proximity to the true surface (convergence) and uniformity and coverage of the estimated set over the true surface (diversity). With more number of objectives, the challenges become more profound. Thus, better strategies have …


Mathematical Formulations For Complex Resource Scheduling Problems., T. R. Lalita Dr. Jan 2022

Mathematical Formulations For Complex Resource Scheduling Problems., T. R. Lalita Dr.

Doctoral Theses

This thesis deals with development of effective models for large scale real-world resource scheduling problems. Efficient utilization of resources is crucial for any organization or industry as resources are often scarce. Scheduling them in an optimal way can not only take care of the scarcity but has potential economic benefits. Optimal utilization of resources reduces costs and thereby provides a competitive edge in the business world. Resources can be of different types such as human (personnel-skilled and unskilled), financial(budgets), materials, infrastructures(airports and seaports with designed facilities, windmills, warehouses’ area, hotel rooms etc) and equipment (microprocessors, cranes, machinery, aircraft simulators for …


Analyzing Marriage Statistics As Recorded In The Journal Of The American Statistical Association From 1889 To 2012, Annalee Soohoo Jan 2022

Analyzing Marriage Statistics As Recorded In The Journal Of The American Statistical Association From 1889 To 2012, Annalee Soohoo

CMC Senior Theses

The United States has been tracking American marriage statistics since its founding. According to the United States Census Bureau, “marital status and marital history data help federal agencies understand marriage trends, forecast future needs of programs that have spousal benefits, and measure the effects of policies and programs that focus on the well-being of families, including tax policies and financial assistance programs.”[1] With such a wide scope of applications, it is understandable why marriage statistics are so highly studied and well-documented.

This thesis will analyze American marriage patterns over the past 100 years as documented in the Journal of …


Finding The Best Predictors For Foot Traffic In Us Seafood Restaurants, Isabel Paige Beaulieu Jan 2022

Finding The Best Predictors For Foot Traffic In Us Seafood Restaurants, Isabel Paige Beaulieu

Honors Theses and Capstones

COVID-19 caused state and nation-wide lockdowns, which altered human foot traffic, especially in restaurants. The seafood sector in particular suffered greatly as there was an increase in illegal fishing, it is made up of perishable goods, it is seasonal in some places, and imports and exports were slowed. Foot traffic data is useful for business owners to have to know how much to order, how many employees to schedule, etc. One issue is that the data is very expensive, hard to get, and not available until months after it is recorded. Our goal is to not only find covariates that …


Mary Eleanor Spear's Importance To The History Of Statistical Visualization, Melanie Williams Jan 2022

Mary Eleanor Spear's Importance To The History Of Statistical Visualization, Melanie Williams

CMC Senior Theses

This paper will demonstrate why Mary Eleanor Spear (1897-1986) is an important figure in the history of statistical visualization. She lead an impressive career working in the federal government as a data analyst before "data analyst" became a thing. She wrote and illustrated two comprehensive textbooks which furthered the art of statistical visualization. Her textbooks cover extensive graphing knowledge still valuable to statisticians and viewers today. Most notable of her works is her development of the box plot. In addition to Spear's career and contributions, this paper will also address the lack of female representation in science, technology, engineering, and …


A Monte Carlo Simulation Of Rat Choice Behavior With Interdependent Outcomes, Michelle A. Frankot Jan 2022

A Monte Carlo Simulation Of Rat Choice Behavior With Interdependent Outcomes, Michelle A. Frankot

Graduate Theses, Dissertations, and Problem Reports

Preclinical behavioral neuroscience often uses choice paradigms to capture psychiatric symptoms. In particular, the subfield of operant research produces nested datasets with many discrete choices in a session. The standard analytic practice is to aggregate choice into a continuous variable and analyze using ANOVA or linear regression. However, choice data often have multiple interdependent outcomes of interest, violating an assumption of general linear models. The aim of the current study was to quantify the accuracy of linear mixed-effects regression (LMER) for analyzing data from a 4-choice operant task called the Rodent Gambling Task (RGT), which measures decision-making in the context …