Open Access. Powered by Scholars. Published by Universities.®

Statistical Methodology Commons

Open Access. Powered by Scholars. Published by Universities.®

960 Full-Text Articles 1,368 Authors 242,218 Downloads 76 Institutions

All Articles in Statistical Methodology

Faceted Search

960 full-text articles. Page 1 of 27.

A Comparison Of R, Sas, And Python Implementations Of Random Forests, Breckell Soifua 2018 Utah State University

A Comparison Of R, Sas, And Python Implementations Of Random Forests, Breckell Soifua

All Graduate Plan B and other Reports

The Random Forest method is a useful machine learning tool developed by Leo Breiman. There are many existing implementations across different programming languages; the most popular of which exist in R, SAS, and Python. In this paper, we conduct a comprehensive comparison of these implementations with regards to the accuracy, variable importance measurements, and timing. This comparison was done on a variety of real and simulated data with different classification difficulty levels, number of predictors, and sample sizes. The comparison shows unexpectedly different results between the three implementations.


Improving Shewhart Control Chart Performance In The Presence Of Measurement Error Using Multiple Measurements And Two-Stage Sampling, Kenneth W. Linna 2018 Auburn University Montgomery

Improving Shewhart Control Chart Performance In The Presence Of Measurement Error Using Multiple Measurements And Two-Stage Sampling, Kenneth W. Linna

Journal of International & Interdisciplinary Business Research

The usual Shewhart control chart efficiently detects large shifts in the mean of a quality characteristic and has been extensively studied in the literature. Most proposed alternatives to the Shewhart chart aim to improve either the signal performance for smaller mean shifts or reduce the sampling effort required to detect a larger shift. Measurement error has been shown in the literature to result in reduced power to detect process shifts. The combination of multiple measurements and two-stage sampling is considered here as a strategy for both regaining power lost due to measurement error and specifically tuning the charts for shifts ...


Inversion Copulas From Nonlinear State Space Models With An Application To Inflation Forecasting, Michael S. Smith, Worapree Ole Maneesoonthorn 2018 Melbourne Business School

Inversion Copulas From Nonlinear State Space Models With An Application To Inflation Forecasting, Michael S. Smith, Worapree Ole Maneesoonthorn

Michael Stanley Smith

We propose the construction of copulas through the inversion of nonlinear state space models. These copulas allow for new time series models that have the same serial dependence structure as a state space model, but with an arbitrary marginal distribution, and flexible density forecasts. We examine the time series properties of the copulas, outline serial dependence measures, and estimate the models using likelihood-based methods. Copulas constructed from three example state space models are considered: a stochastic volatility model with an unobserved component, a Markov switching autoregression, and a Gaussian linear unobserved component model. We show that all three inversion copulas ...


Discrete Ranked Set Sampling, Heng Cui 2018 Southern Methodist University

Discrete Ranked Set Sampling, Heng Cui

Statistical Science Theses and Dissertations

Ranked set sampling (RSS) is an efficient data collection framework compared to simple random sampling (SRS). It is widely used in various application areas such as agriculture, environment, sociology, and medicine, especially in situations where measurement is expensive but ranking is less costly. Most past research in RSS focused on situations where the underlying distribution is continuous. However, it is not unusual to have a discrete data generation mechanism. Estimating statistical functionals are challenging as ties may truly exist in discrete RSS. In this thesis, we started with estimating the cumulative distribution function (CDF) in discrete RSS. We proposed two ...


The Perfect Professor, Samuel Legere 2018 Bryant University

The Perfect Professor, Samuel Legere

Honors Projects in Mathematics

The purpose of this project is to identify the similarities and differences between student and faculty perspectives of "The Perfect Professor". Student preferences for professor characteristics may vary, but what is it about the students that causes them to think or feel this way? These predictors may be something such as a student's major, age, gender, race, or even whether or not they are an athlete. I have conducted a survey for both students and faculty to identify these correlating qualities and gather data in hopes to take the first steps that data analysts may then use to match ...


Evaluation Of Using The Bootstrap Procedure To Estimate The Population Variance, Nghia Trong Nguyen 2018 Stephen F Austin State University

Evaluation Of Using The Bootstrap Procedure To Estimate The Population Variance, Nghia Trong Nguyen

Electronic Theses and Dissertations

The bootstrap procedure is widely used in nonparametric statistics to generate an empirical sampling distribution from a given sample data set for a statistic of interest. Generally, the results are good for location parameters such as population mean, median, and even for estimating a population correlation. However, the results for a population variance, which is a spread parameter, are not as good due to the resampling nature of the bootstrap method. Bootstrap samples are constructed using sampling with replacement; consequently, groups of observations with zero variance manifest in these samples. As a result, a bootstrap variance estimator will carry a ...


Analysis Challenges For High Dimensional Data, Bangxin Zhao 2018 The University of Western Ontario

Analysis Challenges For High Dimensional Data, Bangxin Zhao

Electronic Thesis and Dissertation Repository

In this thesis, we propose new methodologies targeting the areas of high-dimensional variable screening, influence measure and post-selection inference. We propose a new estimator for the correlation between the response and high-dimensional predictor variables, and based on the estimator we develop a new screening technique termed Dynamic Tilted Current Correlation Screening (DTCCS) for high dimensional variables screening. DTCCS is capable of picking up the relevant predictor variables within a finite number of steps. The DTCCS method takes the popular used sure independent screening (SIS) method and the high-dimensional ordinary least squares projection (HOLP) approach as its special cases.

Two methods ...


Initial Evidence Of Construct Validity Of Data From A Self-Assessment Instrument Of Technological Pedagogical Content Knowledge (Tpack) In 2-Year Public College Faculty In Texas, Kristin C. Scott 2018 University of Texas at Tyler

Initial Evidence Of Construct Validity Of Data From A Self-Assessment Instrument Of Technological Pedagogical Content Knowledge (Tpack) In 2-Year Public College Faculty In Texas, Kristin C. Scott

Human Resource Development Theses and Dissertations

Technological pedagogical content knowledge (TPACK) has been studied in K-12 faculty in the U.S. and around the world using survey methodology. Very few studies of TPACK in post-secondary faculty have been conducted and no peer-reviewed studies in U.S. post-secondary faculty have been published to date. The present study is the first reliability and validity of data from a TPACK survey to be conducted with a large sample of U.S. post-secondary faculty. The professorate of 2-year public college faculty in Texas will help their institutions meet the goals of the state’s higher education strategic plan, 60x30TX. In ...


Using Random Forests To Describe Equity In Higher Education: A Critical Quantitative Analysis Of Utah’S Postsecondary Pipelines, Tyler McDaniel 2018 University of Utah

Using Random Forests To Describe Equity In Higher Education: A Critical Quantitative Analysis Of Utah’S Postsecondary Pipelines, Tyler Mcdaniel

Butler Journal of Undergraduate Research

The following work examines the Random Forest (RF) algorithm as a tool for predicting student outcomes and interrogating the equity of postsecondary education pipelines. The RF model, created using longitudinal data of 41,303 students from Utah's 2008 high school graduation cohort, is compared to logistic and linear models, which are commonly used to predict college access and success. Substantially, this work finds High School GPA to be the best predictor of postsecondary GPA, whereas commonly used ACT and AP test scores are not nearly as important. Each model identified several demographic disparities in higher education access, most significantly ...


The Devil You Don’T Know: A Spatial Analysis Of Crime At Newark’S Prudential Center On Hockey Game Days, Justin Kurland, Eric Piza 2018 Institute for Security and Crime Science - University of Waikato

The Devil You Don’T Know: A Spatial Analysis Of Crime At Newark’S Prudential Center On Hockey Game Days, Justin Kurland, Eric Piza

Journal of Sport Safety and Security

Inspired by empirical research on spatial crime patterns in and around sports venues in the United Kingdom, this paper sought to measure the criminogenic extent of 216 hockey games that took place at the Prudential Center in Newark, NJ between 2007-2016. Do games generate patterns of crime in the areas beyond the arena, and if so, for what type of crime and how far? Police-recorded data for Newark are examined using a variety of exploratory methods and non-parametric permutation tests to visualize differences in crime patterns between game and non-game days across all of Newark and the downtown area. Change ...


The Influence Of A Proposed Margin Criterion On The Accuracy Of Parallel Analysis In Conditions Engendering Underextraction, Justin M. Jones 2018 Western Kentucky University

The Influence Of A Proposed Margin Criterion On The Accuracy Of Parallel Analysis In Conditions Engendering Underextraction, Justin M. Jones

Masters Theses & Specialist Projects

One of the most important decisions to make when performing an exploratory factor or principal component analysis regards the number of factors to retain. Parallel analysis is considered to be the best course of action in these circumstances as it consistently outperforms other factor extraction methods (Zwick & Velicer, 1986). Even so, parallel analysis could benefit from further research and refinement to improve its accuracy. Characteristics such as factor loadings, correlations between factors, and number of variables per factor all have been shown to adversely impact the effectiveness of parallel analysis as a means of identifying the number of factors (Pearson ...


Developing Statistical Methods For Data From Platforms Measuring Gene Expression, Gaoxiang Jia 2018 Southern Methodist University

Developing Statistical Methods For Data From Platforms Measuring Gene Expression, Gaoxiang Jia

Statistical Science Theses and Dissertations

This research contains two topics: (1) PBNPA: a permutation-based non-parametric analysis of CRISPR screen data; (2) RCRnorm: an integrated system of random-coefficient hierarchical regression models for normalizing NanoString nCounter data from FFPE samples.

Clustered regularly-interspaced short palindromic repeats (CRISPR) screens are usually implemented in cultured cells to identify genes with critical functions. Although several methods have been developed or adapted to analyze CRISPR screening data, no single spe- cific algorithm has gained popularity. Thus, rigorous procedures are needed to overcome the shortcomings of existing algorithms. We developed a Permutation-Based Non-Parametric Analysis (PBNPA) algorithm, which computes p-values at the gene level ...


Robust Estimation Of The Average Treatment Effect In Alzheimer's Disease Clinical Trials, Michael Rosenblum, Aidan McDermont, Elizabeth Colantuoni 2018 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

Robust Estimation Of The Average Treatment Effect In Alzheimer's Disease Clinical Trials, Michael Rosenblum, Aidan Mcdermont, Elizabeth Colantuoni

Johns Hopkins University, Dept. of Biostatistics Working Papers

The primary analysis of Alzheimer's disease clinical trials often involves a mixed-model repeated measure (MMRM) approach. We consider another estimator of the average treatment effect, called targeted minimum loss based estimation (TMLE). This estimator is more robust to violations of assumptions about missing data than MMRM.

We compare TMLE versus MMRM by analyzing data from a completed Alzheimer's disease trial data set and by simulation studies. The simulations involved different missing data distributions, where loss to followup at a given visit could depend on baseline variables, treatment assignment, and the outcome measured at previous visits. The TMLE generally ...


Multivariate Spectral Analysis Of Crism Data To Characterize The Composition Of Mawrth Vallis, Melissa Luna 2018 Wesleyan University

Multivariate Spectral Analysis Of Crism Data To Characterize The Composition Of Mawrth Vallis, Melissa Luna

Melissa Luna

No abstract provided.


Incorporating Historical Models With Adaptive Bayesian Updates, Philip S. Boonstra, Ryan P. Barbaro 2018 The University Of Michigan

Incorporating Historical Models With Adaptive Bayesian Updates, Philip S. Boonstra, Ryan P. Barbaro

The University of Michigan Department of Biostatistics Working Paper Series

This paper considers Bayesian approaches for incorporating information from a historical model into a current analysis when the historical model includes only a subset of covariates currently of interest. The statistical challenge is two-fold. First, the parameters in the nested historical model are not generally equal to their counterparts in the larger current model, neither in value nor interpretation. Second, because the historical information will not be equally informative for all parameters in the current analysis, additional regularization may be required beyond that provided by the historical information. We propose several novel extensions of the so-called power prior that adaptively ...


Building A Better Risk Prevention Model, Steven Hornyak 2018 Houston County Schools

Building A Better Risk Prevention Model, Steven Hornyak

National Youth-At-Risk Conference Savannah

This presentation chronicles the work of Houston County Schools in developing a risk prevention model built on more than ten years of longitudinal student data. In its second year of implementation, Houston At-Risk Profiles (HARP), has proven effective in identifying those students most in need of support and linking them to interventions and supports that lead to improved outcomes and significantly reduces the risk of failure.


Detection Of Porcine Reproductive And Respiratory Syndrome Virus (Prrsv)-Specific Igm-Iga In Oral Fluid Samples Reveals Prrsv Infection In The Presence Of Maternal Antibody, Marisa L. Rotolo, Luis Giménez-Lirola, Ju Ji, Ronaldo Magtoto, Yuly A. Henao-Díaz, Chong Wang, David H. Baum, Karen M. Harmon, Rodger G. Main, Jeffrey J. Zimmerman 2018 Iowa State University

Detection Of Porcine Reproductive And Respiratory Syndrome Virus (Prrsv)-Specific Igm-Iga In Oral Fluid Samples Reveals Prrsv Infection In The Presence Of Maternal Antibody, Marisa L. Rotolo, Luis Giménez-Lirola, Ju Ji, Ronaldo Magtoto, Yuly A. Henao-Díaz, Chong Wang, David H. Baum, Karen M. Harmon, Rodger G. Main, Jeffrey J. Zimmerman

Statistics Publications

The ontogeny of PRRSV antibody in oral fluids has been described using isotype-specific ELISAs. Mirroring the serum response, IgM appears in oral fluid by 7 days post inoculation (DPI), IgA after 7 DPI, and IgG by 9 to 10 DPI. Commercial PRRSV ELISAs target the detection of IgG because the higher concentration of IgG relative to other isotypes provides the best diagnostic discrimination. Oral fluids are increasingly used for PRRSV surveillance in commercial herds, but in younger pigs, a positive ELISA result may be due either to maternal antibody or to antibody produced by the pigs in response to infection ...


Optimized Adaptive Enrichment Designs For Multi-Arm Trials: Learning Which Subpopulations Benefit From Different Treatments, Jon Arni Steingrimsson, Joshua Betz, Tiachen Qian, Michael Rosenblum 2018 Department of Biostatistics, Brown School of Public Health

Optimized Adaptive Enrichment Designs For Multi-Arm Trials: Learning Which Subpopulations Benefit From Different Treatments, Jon Arni Steingrimsson, Joshua Betz, Tiachen Qian, Michael Rosenblum

Johns Hopkins University, Dept. of Biostatistics Working Papers

We consider the problem of designing a randomized trial for comparing two treatments versus a common control in two disjoint subpopulations. The subpopulations could be defined in terms of a biomarker or disease severity measured at baseline. The goal is to determine which treatments benefit which subpopulations. We develop a new class of adaptive enrichment designs tailored to solving this problem. Adaptive enrichment designs involve a preplanned rule for modifying enrollment based on accruing data in an ongoing trial. The proposed designs have preplanned rules for stopping accrual of treatment by subpopulation combinations, either for efficacy or futility. The motivation ...


Phase Ii Adaptive Enrichment Design To Determine The Population To Enroll In Phase Iii Trials, By Selecting Thresholds For Baseline Disease Severity, Yu Du, Gary L. Rosner, Michael Rosenblum 2018 Johns Hopkins Bloomberg School of Public Health, Department of Biostatistics

Phase Ii Adaptive Enrichment Design To Determine The Population To Enroll In Phase Iii Trials, By Selecting Thresholds For Baseline Disease Severity, Yu Du, Gary L. Rosner, Michael Rosenblum

Johns Hopkins University, Dept. of Biostatistics Working Papers

We propose and evaluate a two-stage, phase 2, adaptive clinical trial design. Its goal is to determine whether future phase 3 (confirmatory) trials should be conducted, and if so, which population should be enrolled. The population selected for phase 3 enrollment is defined in terms of a disease severity score measured at baseline. We optimize the phase 2 trial design and analysis in a decision theory framework. Our utility function represents a combination of the cost of conducting phase 3 trials and, if the phase 3 trials are successful, the improved health of the future population minus the cost of ...


Comparing Various Machine Learning Statistical Methods Using Variable Differentials To Predict College Basketball, Nicholas Bennett 2018 The University of Akron

Comparing Various Machine Learning Statistical Methods Using Variable Differentials To Predict College Basketball, Nicholas Bennett

Honors Research Projects

The purpose of this Senior Honors Project is to research, study, and demonstrate newfound knowledge of various machine learning statistical techniques that are not covered in the University of Akron’s statistics major curriculum. This report will be an overview of three machine-learning methods that were used to predict NCAA Basketball results, specifically, the March Madness tournament. The variables used for these methods, models, and tests will include numerous variables kept throughout the season for each team, along with a couple variables that are used by the selection committee when tournament teams are being picked. The end goal is to ...


Digital Commons powered by bepress