Open Access. Powered by Scholars. Published by Universities.®

Statistics and Probability Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistical Theory

Series

Institution
Keyword
Publication Year
Publication

Articles 1 - 30 of 376

Full-Text Articles in Statistics and Probability

Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe Jan 2024

Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe

Data Science and Data Mining

Cyberbullying refers to the act of bullying using electronic means and the internet. In recent years, this act has been identifed to be a major problem among young people and even adults. It can negatively impact one’s emotions and lead to adverse outcomes like depression, anxiety, harassment, and suicide, among others. This has led to the need to employ machine learning techniques to automatically detect cyberbullying and prevent them on various social media platforms. In this study, we want to analyze the combination of some Natural Language Processing (NLP) algorithms (such as Bag-of-Words and TFIDF) with some popular machine learning …


Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe Jan 2024

Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe

Data Science and Data Mining

This project estimates a regression model to predict the superconducting critical temperature based on variables extracted from the superconductor’s chemical formula. The regression model along with the stepwise variable selection gives a reasonable and good predictive model with a lower prediction error (MSE). Variables extracted based on atomic radius, valence, atomic mass and thermal conductivity appeared to have the most contribution to the predictive model.


Uconn Baseball Batting Order Optimization, Gavin Rublewski, Gavin Rublewski May 2023

Uconn Baseball Batting Order Optimization, Gavin Rublewski, Gavin Rublewski

Honors Scholar Theses

Challenging conventional wisdom is at the very core of baseball analytics. Using data and statistical analysis, the sets of rules by which coaches make decisions can be justified, or possibly refuted. One of those sets of rules relates to the construction of a batting order. Through data collection, data adjustment, the construction of a baseball simulator, and the use of a Monte Carlo Simulation, I have assessed thousands of possible batting orders to determine the roster-specific strategies that lead to optimal run production for the 2023 UConn baseball team. This paper details a repeatable process in which basic player statistics …


On Misuses Of The Kolmogorov–Smirnov Test For One-Sample Goodness-Of-Fit, Anthony Zeimbekakis Apr 2022

On Misuses Of The Kolmogorov–Smirnov Test For One-Sample Goodness-Of-Fit, Anthony Zeimbekakis

Honors Scholar Theses

The Kolmogorov–Smirnov (KS) test is one of the most popular goodness-of-fit tests for comparing a sample with a hypothesized parametric distribution. Nevertheless, it has often been misused. The standard one-sample KS test applies to independent, continuous data with a hypothesized distribution that is completely specified. It is not uncommon, however, to see in the literature that it was applied to dependent, discrete, or rounded data, with hypothesized distributions containing estimated parameters. For example, it has been "discovered" multiple times that the test is too conservative when the parameters are estimated. We demonstrate misuses of the one-sample KS test in three …


A Simple Algorithm For Generating A New Two Sample Type-Ii Progressive Censoring With Applications, E. M. Shokr, Rashad Mohamed El-Sagheer, Mahmoud Mansour, H. M. Faied, B. S. El-Desouky Jan 2022

A Simple Algorithm For Generating A New Two Sample Type-Ii Progressive Censoring With Applications, E. M. Shokr, Rashad Mohamed El-Sagheer, Mahmoud Mansour, H. M. Faied, B. S. El-Desouky

Basic Science Engineering

In this article, we introduce a simple algorithm to generating a new type-II progressive censoring scheme for two samples. It is observed that the proposed algorithm can be applied for any continues probability distribution. Moreover, the description model and necessary assumptions are discussed. In addition, the steps of simple generation algorithm along with programming steps are also constructed on real example. The inference of two Weibull Frechet populations are discussed under the proposed algorithm. Both classical and Bayesian inferential approaches of the distribution parameters are discussed. Furthermore, approximate confidence intervals are constructed based on the asymptotic distribution of the maximum …


Using Stability To Select A Shrinkage Method, Dean Dustin May 2020

Using Stability To Select A Shrinkage Method, Dean Dustin

Department of Statistics: Dissertations, Theses, and Student Work

Shrinkage methods are estimation techniques based on optimizing expressions to find which variables to include in an analysis, typically a linear regression. The general form of these expressions is the sum of an empirical risk plus a complexity penalty based on the number of parameters. Many shrinkage methods are known to satisfy an ‘oracle’ property meaning that asymptotically they select the correct variables and estimate their coefficients efficiently. In Section 1.2, we show oracle properties in two general settings. The first uses a log likelihood in place of the empirical risk and allows a general class of penalties. The second …


On Arnold–Villasenor Conjectures For Characterizaing Exponential Distribution Based On Sample Of Size Three, George Yanev May 2020

On Arnold–Villasenor Conjectures For Characterizaing Exponential Distribution Based On Sample Of Size Three, George Yanev

School of Mathematical and Statistical Sciences Faculty Publications and Presentations

Arnold and Villasenor [4] obtain a series of characterizations of the exponential distribution based on random samples of size two. These results were already applied in constructing goodness-of-fit tests. Extending the techniques from [4], we prove some of Arnold and Villasenor’s conjectures for samples of size three. An example with simulated data is discussed.


Personal Foul: How Head Trauma And The Insurance Industry Are Threatening Sports, Zachary Cooler Apr 2020

Personal Foul: How Head Trauma And The Insurance Industry Are Threatening Sports, Zachary Cooler

Senior Honors Theses

This thesis will investigate the growing problem of head trauma in contact sports like football, hockey, and soccer through medical studies, implications to the insurance industry, and ongoing litigation. The thesis will investigate medical studies that are finding more evidence to support the claim that contact sports players are more likely to receive head trauma symptoms such as memory loss, mood swings, and even Lou Gehrig’s disease in extreme cases. The thesis will also demonstrate that these medical symptoms and monetary losses from medical claims are convincing insurance companies to withdraw insurance coverage for sports leagues, which they are justifying …


A Monte Carlo Analysis Of Standard Error-Based Methods For Computing Confidence Intervals, Elayna Wichert Apr 2020

A Monte Carlo Analysis Of Standard Error-Based Methods For Computing Confidence Intervals, Elayna Wichert

Masters Theses & Specialist Projects

The objective of this study is to empirically test existing techniques to calculate the likely range of values for a Classical Test Theory true score given an observed score. The traditional method for forming these confidence intervals has used the standard error of measurement (SEM) as the basis for this confidence interval. An alternate equation, the standard error of estimate (SEE), has been recommended in place of the SEM for this purpose, yet it remains overlooked in the field of psychometrics. It is important that the correct equation be used in various applications in personnel psychology. Monte Carlo analyses were …


Inferences For Weibull-Gamma Distribution In Presence Of Partially Accelerated Life Test, Mahmoud Mansour, M A W Mahmoud Prof., Rashad El-Sagheer Mar 2020

Inferences For Weibull-Gamma Distribution In Presence Of Partially Accelerated Life Test, Mahmoud Mansour, M A W Mahmoud Prof., Rashad El-Sagheer

Basic Science Engineering

In this paper, the point at issue is to deliberate point and interval estimations for the parameters of Weibull-Gamma distribution (WGD) using progressively Type-II censored (PROG-II-C) sample under step stress partially accelerated life test (SSPALT) model. The maximum likelihood (ML), Bayes, and four parametric bootstrap methods are used to obtain the point estimations for the distribution parameters and the acceleration factor. Furthermore, the approximate confidence intervals (ACIs), four bootstrap confidence intervals and credible intervals of the estimators have been gotten. The results of Bayes estimators are computed under the squared error loss (SEL) function using Markov Chain Monte Carlo (MCMC) …


Generalized Matrix Decomposition Regression: Estimation And Inference For Two-Way Structured Data, Yue Wang, Ali Shojaie, Tim Randolph, Jing Ma Dec 2019

Generalized Matrix Decomposition Regression: Estimation And Inference For Two-Way Structured Data, Yue Wang, Ali Shojaie, Tim Randolph, Jing Ma

UW Biostatistics Working Paper Series

Analysis of two-way structured data, i.e., data with structures among both variables and samples, is becoming increasingly common in ecology, biology and neuro-science. Classical dimension-reduction tools, such as the singular value decomposition (SVD), may perform poorly for two-way structured data. The generalized matrix decomposition (GMD, Allen et al., 2014) extends the SVD to two-way structured data and thus constructs singular vectors that account for both structures. While the GMD is a useful dimension-reduction tool for exploratory analysis of two-way structured data, it is unsupervised and cannot be used to assess the association between such data and an outcome of interest. …


Statistical Inference For Networks Of High-Dimensional Point Processes, Xu Wang, Mladen Kolar, Ali Shojaie Dec 2019

Statistical Inference For Networks Of High-Dimensional Point Processes, Xu Wang, Mladen Kolar, Ali Shojaie

UW Biostatistics Working Paper Series

Fueled in part by recent applications in neuroscience, high-dimensional Hawkes process have become a popular tool for modeling the network of interactions among multivariate point process data. While evaluating the uncertainty of the network estimates is critical in scientific applications, existing methodological and theoretical work have only focused on estimation. To bridge this gap, this paper proposes a high-dimensional statistical inference procedure with theoretical guarantees for multivariate Hawkes process. Key to this inference procedure is a new concentration inequality on the first- and second-order statistics for integrated stochastic processes, which summarizes the entire history of the process. We apply this …


Optimal Design For A Causal Structure, Zaher Kmail Aug 2019

Optimal Design For A Causal Structure, Zaher Kmail

Department of Statistics: Dissertations, Theses, and Student Work

Linear models and mixed models are important statistical tools. But in many natural phenomena, there is more than one endogenous variable involved and these variables are related in a sophisticated way. Structural Equation Modeling (SEM) is often used to model the complex relationships between the endogenous and exogenous variables. It was first implemented in research to estimate the strength and direction of direct and indirect effects among variables and to measure the relative magnitude of each causal factor.

Historically, traditional optimal design theory focuses on univariate linear, nonlinear, and mixed models. There is no current literature on the subject of …


Unified Methods For Feature Selection In Large-Scale Genomic Studies With Censored Survival Outcomes, Lauren Spirko-Burns, Karthik Devarajan Mar 2019

Unified Methods For Feature Selection In Large-Scale Genomic Studies With Censored Survival Outcomes, Lauren Spirko-Burns, Karthik Devarajan

COBRA Preprint Series

One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease's process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards …


Non Parametric Test For Testing Exponentiality Against Exponential Better Than Used In Laplace Transform Order, Mahmoud Mansour, M A W Mahmoud Prof. Mar 2019

Non Parametric Test For Testing Exponentiality Against Exponential Better Than Used In Laplace Transform Order, Mahmoud Mansour, M A W Mahmoud Prof.

Basic Science Engineering

In this paper, the test statistic for testing exponentiality against exponential better than used in Laplace transform order (EBUL) based on the Laplace transform technique is proposed. Pitman’s asymptotic efficiency of our test is calculated and compared with other tests. The percentiles of this test are tabulated. The powers of the test are estimated for famously used distributions in aging problems. In the case of censored data, our test is applied and the percentiles are also calculated and tabulated. Finally, real examples in different areas are utilized as practical applications for the proposed test.


Controlling For Confounding Via Propensity Score Methods Can Result In Biased Estimation Of The Conditional Auc: A Simulation Study, Hadiza I. Galadima, Donna K. Mcclish Jan 2019

Controlling For Confounding Via Propensity Score Methods Can Result In Biased Estimation Of The Conditional Auc: A Simulation Study, Hadiza I. Galadima, Donna K. Mcclish

Community & Environmental Health Faculty Publications

In the medical literature, there has been an increased interest in evaluating association between exposure and outcomes using nonrandomized observational studies. However, because assignments to exposure are not random in observational studies, comparisons of outcomes between exposed and nonexposed subjects must account for the effect of confounders. Propensity score methods have been widely used to control for confounding, when estimating exposure effect. Previous studies have shown that conditioning on the propensity score results in biased estimation of conditional odds ratio and hazard ratio. However, research is lacking on the performance of propensity score methods for covariate adjustment when estimating the …


On The Performance Of Some Poisson Ridge Regression Estimators, Cynthia Zaldivar Mar 2018

On The Performance Of Some Poisson Ridge Regression Estimators, Cynthia Zaldivar

FIU Electronic Theses and Dissertations

Multiple regression models play an important role in analyzing and making predictions about data. Prediction accuracy becomes lower when two or more explanatory variables in the model are highly correlated. One solution is to use ridge regression. The purpose of this thesis is to study the performance of available ridge regression estimators for Poisson regression models in the presence of moderately to highly correlated variables. As performance criteria, we use mean square error (MSE), mean absolute percentage error (MAPE), and percentage of times the maximum likelihood (ML) estimator produces a higher MSE than the ridge regression estimator. A Monte Carlo …


Gilmore Girls And Instagram: A Statistical Look At The Popularity Of The Television Show Through The Lens Of An Instagram Page, Brittany Simmons May 2017

Gilmore Girls And Instagram: A Statistical Look At The Popularity Of The Television Show Through The Lens Of An Instagram Page, Brittany Simmons

Student Scholar Symposium Abstracts and Posters

After going on the Warner Brothers Tour in December of 2015, I created a Gilmore Girls Instagram account. This account, which started off as a way for me to create edits of the show and post my photos from the tour turned into something bigger than I ever could have imagined. In just over a year I have over 55,000 followers. I post content including revival news, merchandise, and edits of the show that have been featured in Entertainment Weekly, Bustle, E! News, People Magazine, Yahoo News, & GilmoreNews.

I created a dataset of qualitative and quantitative outcomes from my …


Inference On The Stress-Strength Model From Weibull Gamma Distribution, Mahmoud Mansour, Rashad El-Sagheer, M. A. W. Mahmoud Prof. May 2017

Inference On The Stress-Strength Model From Weibull Gamma Distribution, Mahmoud Mansour, Rashad El-Sagheer, M. A. W. Mahmoud Prof.

Basic Science Engineering

No abstract provided.


High-Dimensional Repeated Measures, Martin Happ, Solomon W. Harrar, Arne C. Bathke Apr 2017

High-Dimensional Repeated Measures, Martin Happ, Solomon W. Harrar, Arne C. Bathke

Statistics Faculty Publications

Recently, new tests for main and simple treatment effects, time effects, and treatment by time interactions in possibly high-dimensional multigroup repeated-measures designs with unequal covariance matrices have been proposed. Technical details for using more than one between-subject and more than one within-subject factor are presented in this article. Furthermore, application to electroencephalography (EEG) data of a neurological study with two whole-plot factors (diagnosis and sex) and two subplot factors (variable and region) is shown with the R package HRM (high-dimensional repeated measures).


Evaluation Of Progress Towards The Unaids 90-90-90 Hiv Care Cascade: A Description Of Statistical Methods Used In An Interim Analysis Of The Intervention Communities In The Search Study, Laura Balzer, Joshua Schwab, Mark J. Van Der Laan, Maya L. Petersen Feb 2017

Evaluation Of Progress Towards The Unaids 90-90-90 Hiv Care Cascade: A Description Of Statistical Methods Used In An Interim Analysis Of The Intervention Communities In The Search Study, Laura Balzer, Joshua Schwab, Mark J. Van Der Laan, Maya L. Petersen

U.C. Berkeley Division of Biostatistics Working Paper Series

WHO guidelines call for universal antiretroviral treatment, and UNAIDS has set a global target to virally suppress most HIV-positive individuals. Accurate estimates of population-level coverage at each step of the HIV care cascade (testing, treatment, and viral suppression) are needed to assess the effectiveness of "test and treat" strategies implemented to achieve this goal. The data available to inform such estimates, however, are susceptible to informative missingness: the number of HIV-positive individuals in a population is unknown; individuals tested for HIV may not be representative of those whom a testing intervention fails to reach, and HIV-positive individuals with a viral …


Stochastic Optimization Of Adaptive Enrichment Designs For Two Subpopulations, Aaron Fisher, Michael Rosenblum Dec 2016

Stochastic Optimization Of Adaptive Enrichment Designs For Two Subpopulations, Aaron Fisher, Michael Rosenblum

Johns Hopkins University, Dept. of Biostatistics Working Papers

An adaptive enrichment design is a randomized trial that allows enrollment criteria to be modified at interim analyses, based on a preset decision rule. When there is prior uncertainty regarding treatment effect heterogeneity, these trial designs can provide improved power for detecting treatment effects in subpopulations. We present a simulated annealing approach to search over the space of decision rules and other parameters for an adaptive enrichment design. The goal is to minimize the expected number enrolled or expected duration, while preserving the appropriate power and Type I error rate. We also explore the benefits of parallel computation in the …


On Some Test Statistics For Testing The Population Skewness And Kurtosis: An Empirical Study, Yawen Guo Aug 2016

On Some Test Statistics For Testing The Population Skewness And Kurtosis: An Empirical Study, Yawen Guo

FIU Electronic Theses and Dissertations

The purpose of this thesis is to propose some test statistics for testing the skewness and kurtosis parameters of a distribution, not limited to a normal distribution. Since a theoretical comparison is not possible, a simulation study has been conducted to compare the performance of the test statistics. We have compared both parametric methods (classical method with normality assumption) and non-parametric methods (bootstrap in Bias Corrected Standard Method, Efron’s Percentile Method, Hall’s Percentile Method and Bias Corrected Percentile Method). Our simulation results for testing the skewness parameter indicate that the power of the tests differs significantly across sample sizes, the …


Conditional Screening For Ultra-High Dimensional Covariates With Survival Outcomes, Hyokyoung Grace Hong, Jian Kang, Yi Li Mar 2016

Conditional Screening For Ultra-High Dimensional Covariates With Survival Outcomes, Hyokyoung Grace Hong, Jian Kang, Yi Li

The University of Michigan Department of Biostatistics Working Paper Series

Identifying important biomarkers that are predictive for cancer patients' prognosis is key in gaining better insights into the biological influences on the disease and has become a critical component of precision medicine. The emergence of large-scale biomedical survival studies, which typically involve excessive number of biomarkers, has brought high demand in designing efficient screening tools for selecting predictive biomarkers. The vast amount of biomarkers defies any existing variable selection methods via regularization. The recently developed variable screening methods, though powerful in many practical setting, fail to incorporate prior information on the importance of each biomarker and are less powerful in …


Models For Hsv Shedding Must Account For Two Levels Of Overdispersion, Amalia Magaret Jan 2016

Models For Hsv Shedding Must Account For Two Levels Of Overdispersion, Amalia Magaret

UW Biostatistics Working Paper Series

We have frequently implemented crossover studies to evaluate new therapeutic interventions for genital herpes simplex virus infection. The outcome measured to assess the efficacy of interventions on herpes disease severity is the viral shedding rate, defined as the frequency of detection of HSV on the genital skin and mucosa. We performed a simulation study to ascertain whether our standard model, which we have used previously, was appropriately considering all the necessary features of the shedding data to provide correct inference. We simulated shedding data under our standard, validated assumptions and assessed the ability of 5 different models to reproduce the …


Inequality In Treatment Benefits: Can We Determine If A New Treatment Benefits The Many Or The Few?, Emily Huang, Ethan Fang, Daniel Hanley, Michael Rosenblum Dec 2015

Inequality In Treatment Benefits: Can We Determine If A New Treatment Benefits The Many Or The Few?, Emily Huang, Ethan Fang, Daniel Hanley, Michael Rosenblum

Johns Hopkins University, Dept. of Biostatistics Working Papers

The primary analysis in many randomized controlled trials focuses on the average treatment effect and does not address whether treatment benefits are widespread or limited to a select few. This problem affects many disease areas, since it stems from how randomized trials, often the gold standard for evaluating treatments, are designed and analyzed. Our goal is to learn about the fraction who benefit from a treatment, based on randomized trial data. We consider the case where the outcome is ordinal, with binary outcomes as a special case. In general, the fraction who benefit is a non-identifiable parameter, and the best …


C-Learning: A New Classification Framework To Estimate Optimal Dynamic Treatment Regimes, Baqun Zhang, Min Zhang Aug 2015

C-Learning: A New Classification Framework To Estimate Optimal Dynamic Treatment Regimes, Baqun Zhang, Min Zhang

The University of Michigan Department of Biostatistics Working Paper Series

Personalizing treatment to accommodate patient heterogeneity and the evolving nature of a disease over time has received considerable attention lately. A dynamic treatment regime is a set of decision rules, each corresponding to a decision point, that determine that next treatment based on each individual’s own available characteristics and treatment history up to that point. We show that identifying the optimal dynamic treatment regime can be recast as a sequential classification problem and is equivalent to sequentially minimizing a weighted expected misclassification error. This general classification perspective targets the exact goal of optimally individualizing treatments and is new and fundamentally …


Bootstrapping Vs. Asymptotic Theory In Property And Casualty Loss Reserving, Andrew J. Difronzo Jr. Apr 2015

Bootstrapping Vs. Asymptotic Theory In Property And Casualty Loss Reserving, Andrew J. Difronzo Jr.

Honors Projects in Mathematics

One of the key functions of a property and casualty (P&C) insurance company is loss reserving, which calculates how much money the company should retain in order to pay out future claims. Most P&C insurance companies use non-stochastic (non-random) methods to estimate these future liabilities. However, future loss data can also be projected using generalized linear models (GLMs) and stochastic simulation. Two simulation methods that will be the focus of this project are: bootstrapping methodology, which resamples the original loss data (creating pseudo-data in the process) and fits the GLM parameters based on the new data to estimate the sampling …


Best Practice Recommendations For Data Screening, Justin A. Desimone, Peter D. Harms, Alice J. Desimone Feb 2015

Best Practice Recommendations For Data Screening, Justin A. Desimone, Peter D. Harms, Alice J. Desimone

Department of Management: Faculty Publications

Survey respondents differ in their levels of attention and effort when responding to items. There are a number of methods researchers may use to identify respondents who fail to exert sufficient effort in order to increase the rigor of analysis and enhance the trustworthiness of study results. Screening techniques are organized into three general categories, which differ in impact on survey design and potential respondent awareness. Assumptions and considerations regarding appropriate use of screening techniques are discussed along with descriptions of each technique. The utility of each screening technique is a function of survey design and administration. Each technique has …


Statistical Inference For The Mean Outcome Under A Possibly Non-Unique Optimal Treatment Strategy, Alexander R. Luedtke, Mark J. Van Der Laan Dec 2014

Statistical Inference For The Mean Outcome Under A Possibly Non-Unique Optimal Treatment Strategy, Alexander R. Luedtke, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

We consider challenges that arise in the estimation of the value of an optimal individualized treatment strategy defined as the treatment rule that maximizes the population mean outcome, where the candidate treatment rules are restricted to depend on baseline covariates. We prove a necessary and sufficient condition for the pathwise differentiability of the optimal value, a key condition needed to develop a regular asymptotically linear (RAL) estimator of this parameter. The stated condition is slightly more general than the previous condition implied in the literature. We then describe an approach to obtain root-n rate confidence intervals for the optimal value …