A Comparison Of R, Sas, And Python Implementations Of Random Forests, 2018 Utah State University
A Comparison Of R, Sas, And Python Implementations Of Random Forests, Breckell Soifua
All Graduate Plan B and other Reports
The Random Forest method is a useful machine learning tool developed by Leo Breiman. There are many existing implementations across different programming languages; the most popular of which exist in R, SAS, and Python. In this paper, we conduct a comprehensive comparison of these implementations with regards to the accuracy, variable importance measurements, and timing. This comparison was done on a variety of real and simulated data with different classification difficulty levels, number of predictors, and sample sizes. The comparison shows unexpectedly different results between the three implementations.
Improving Shewhart Control Chart Performance In The Presence Of Measurement Error Using Multiple Measurements And Two-Stage Sampling, 2018 Auburn University Montgomery
Improving Shewhart Control Chart Performance In The Presence Of Measurement Error Using Multiple Measurements And Two-Stage Sampling, Kenneth W. Linna
Journal of International & Interdisciplinary Business Research
The usual Shewhart control chart efficiently detects large shifts in the mean of a quality characteristic and has been extensively studied in the literature. Most proposed alternatives to the Shewhart chart aim to improve either the signal performance for smaller mean shifts or reduce the sampling effort required to detect a larger shift. Measurement error has been shown in the literature to result in reduced power to detect process shifts. The combination of multiple measurements and two-stage sampling is considered here as a strategy for both regaining power lost due to measurement error and specifically tuning the charts for shifts ...
Inversion Copulas From Nonlinear State Space Models With An Application To Inflation Forecasting, 2018 Melbourne Business School
Inversion Copulas From Nonlinear State Space Models With An Application To Inflation Forecasting, Michael S. Smith, Worapree Ole Maneesoonthorn
Michael Stanley Smith
Discrete Ranked Set Sampling, 2018 Southern Methodist University
Discrete Ranked Set Sampling, Heng Cui
Statistical Science Theses and Dissertations
Ranked set sampling (RSS) is an efficient data collection framework compared to simple random sampling (SRS). It is widely used in various application areas such as agriculture, environment, sociology, and medicine, especially in situations where measurement is expensive but ranking is less costly. Most past research in RSS focused on situations where the underlying distribution is continuous. However, it is not unusual to have a discrete data generation mechanism. Estimating statistical functionals are challenging as ties may truly exist in discrete RSS. In this thesis, we started with estimating the cumulative distribution function (CDF) in discrete RSS. We proposed two ...
The Perfect Professor, 2018 Bryant University
The Perfect Professor, Samuel Legere
Honors Projects in Mathematics
The purpose of this project is to identify the similarities and differences between student and faculty perspectives of "The Perfect Professor". Student preferences for professor characteristics may vary, but what is it about the students that causes them to think or feel this way? These predictors may be something such as a student's major, age, gender, race, or even whether or not they are an athlete. I have conducted a survey for both students and faculty to identify these correlating qualities and gather data in hopes to take the first steps that data analysts may then use to match ...
Evaluation Of Using The Bootstrap Procedure To Estimate The Population Variance, 2018 Stephen F Austin State University
Evaluation Of Using The Bootstrap Procedure To Estimate The Population Variance, Nghia Trong Nguyen
Electronic Theses and Dissertations
The bootstrap procedure is widely used in nonparametric statistics to generate an empirical sampling distribution from a given sample data set for a statistic of interest. Generally, the results are good for location parameters such as population mean, median, and even for estimating a population correlation. However, the results for a population variance, which is a spread parameter, are not as good due to the resampling nature of the bootstrap method. Bootstrap samples are constructed using sampling with replacement; consequently, groups of observations with zero variance manifest in these samples. As a result, a bootstrap variance estimator will carry a ...
Analysis Challenges For High Dimensional Data, 2018 The University of Western Ontario
Analysis Challenges For High Dimensional Data, Bangxin Zhao
Electronic Thesis and Dissertation Repository
In this thesis, we propose new methodologies targeting the areas of high-dimensional variable screening, influence measure and post-selection inference. We propose a new estimator for the correlation between the response and high-dimensional predictor variables, and based on the estimator we develop a new screening technique termed Dynamic Tilted Current Correlation Screening (DTCCS) for high dimensional variables screening. DTCCS is capable of picking up the relevant predictor variables within a finite number of steps. The DTCCS method takes the popular used sure independent screening (SIS) method and the high-dimensional ordinary least squares projection (HOLP) approach as its special cases.
Two methods ...
Initial Evidence Of Construct Validity Of Data From A Self-Assessment Instrument Of Technological Pedagogical Content Knowledge (Tpack) In 2-Year Public College Faculty In Texas, 2018 University of Texas at Tyler
Initial Evidence Of Construct Validity Of Data From A Self-Assessment Instrument Of Technological Pedagogical Content Knowledge (Tpack) In 2-Year Public College Faculty In Texas, Kristin C. Scott
Human Resource Development Theses and Dissertations
Technological pedagogical content knowledge (TPACK) has been studied in K-12 faculty in the U.S. and around the world using survey methodology. Very few studies of TPACK in post-secondary faculty have been conducted and no peer-reviewed studies in U.S. post-secondary faculty have been published to date. The present study is the first reliability and validity of data from a TPACK survey to be conducted with a large sample of U.S. post-secondary faculty. The professorate of 2-year public college faculty in Texas will help their institutions meet the goals of the state’s higher education strategic plan, 60x30TX. In ...
Using Random Forests To Describe Equity In Higher Education: A Critical Quantitative Analysis Of Utah’S Postsecondary Pipelines, Tyler Mcdaniel
Butler Journal of Undergraduate Research
The following work examines the Random Forest (RF) algorithm as a tool for predicting student outcomes and interrogating the equity of postsecondary education pipelines. The RF model, created using longitudinal data of 41,303 students from Utah's 2008 high school graduation cohort, is compared to logistic and linear models, which are commonly used to predict college access and success. Substantially, this work finds High School GPA to be the best predictor of postsecondary GPA, whereas commonly used ACT and AP test scores are not nearly as important. Each model identified several demographic disparities in higher education access, most significantly ...
The Devil You Don’T Know: A Spatial Analysis Of Crime At Newark’S Prudential Center On Hockey Game Days, 2018 Institute for Security and Crime Science - University of Waikato
The Devil You Don’T Know: A Spatial Analysis Of Crime At Newark’S Prudential Center On Hockey Game Days, Justin Kurland, Eric Piza
Journal of Sport Safety and Security
Inspired by empirical research on spatial crime patterns in and around sports venues in the United Kingdom, this paper sought to measure the criminogenic extent of 216 hockey games that took place at the Prudential Center in Newark, NJ between 2007-2016. Do games generate patterns of crime in the areas beyond the arena, and if so, for what type of crime and how far? Police-recorded data for Newark are examined using a variety of exploratory methods and non-parametric permutation tests to visualize differences in crime patterns between game and non-game days across all of Newark and the downtown area. Change ...
The Influence Of A Proposed Margin Criterion On The Accuracy Of Parallel Analysis In Conditions Engendering Underextraction, 2018 Western Kentucky University
The Influence Of A Proposed Margin Criterion On The Accuracy Of Parallel Analysis In Conditions Engendering Underextraction, Justin M. Jones
Masters Theses & Specialist Projects
One of the most important decisions to make when performing an exploratory factor or principal component analysis regards the number of factors to retain. Parallel analysis is considered to be the best course of action in these circumstances as it consistently outperforms other factor extraction methods (Zwick & Velicer, 1986). Even so, parallel analysis could benefit from further research and refinement to improve its accuracy. Characteristics such as factor loadings, correlations between factors, and number of variables per factor all have been shown to adversely impact the effectiveness of parallel analysis as a means of identifying the number of factors (Pearson ...
Developing Statistical Methods For Data From Platforms Measuring Gene Expression, 2018 Southern Methodist University
Developing Statistical Methods For Data From Platforms Measuring Gene Expression, Gaoxiang Jia
Statistical Science Theses and Dissertations
This research contains two topics: (1) PBNPA: a permutation-based non-parametric analysis of CRISPR screen data; (2) RCRnorm: an integrated system of random-coefficient hierarchical regression models for normalizing NanoString nCounter data from FFPE samples.
Clustered regularly-interspaced short palindromic repeats (CRISPR) screens are usually implemented in cultured cells to identify genes with critical functions. Although several methods have been developed or adapted to analyze CRISPR screening data, no single spe- cific algorithm has gained popularity. Thus, rigorous procedures are needed to overcome the shortcomings of existing algorithms. We developed a Permutation-Based Non-Parametric Analysis (PBNPA) algorithm, which computes p-values at the gene level ...
Robust Estimation Of The Average Treatment Effect In Alzheimer's Disease Clinical Trials, 2018 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
Robust Estimation Of The Average Treatment Effect In Alzheimer's Disease Clinical Trials, Michael Rosenblum, Aidan Mcdermont, Elizabeth Colantuoni
Johns Hopkins University, Dept. of Biostatistics Working Papers
The primary analysis of Alzheimer's disease clinical trials often involves a mixed-model repeated measure (MMRM) approach. We consider another estimator of the average treatment effect, called targeted minimum loss based estimation (TMLE). This estimator is more robust to violations of assumptions about missing data than MMRM.
We compare TMLE versus MMRM by analyzing data from a completed Alzheimer's disease trial data set and by simulation studies. The simulations involved different missing data distributions, where loss to followup at a given visit could depend on baseline variables, treatment assignment, and the outcome measured at previous visits. The TMLE generally ...
Multivariate Spectral Analysis Of Crism Data To Characterize The Composition Of Mawrth Vallis, 2018 Wesleyan University
Multivariate Spectral Analysis Of Crism Data To Characterize The Composition Of Mawrth Vallis, Melissa Luna
No abstract provided.
Incorporating Historical Models With Adaptive Bayesian Updates, 2018 The University Of Michigan
Incorporating Historical Models With Adaptive Bayesian Updates, Philip S. Boonstra, Ryan P. Barbaro
The University of Michigan Department of Biostatistics Working Paper Series
This paper considers Bayesian approaches for incorporating information from a historical model into a current analysis when the historical model includes only a subset of covariates currently of interest. The statistical challenge is two-fold. First, the parameters in the nested historical model are not generally equal to their counterparts in the larger current model, neither in value nor interpretation. Second, because the historical information will not be equally informative for all parameters in the current analysis, additional regularization may be required beyond that provided by the historical information. We propose several novel extensions of the so-called power prior that adaptively ...
Building A Better Risk Prevention Model, 2018 Houston County Schools
Building A Better Risk Prevention Model, Steven Hornyak
National Youth-At-Risk Conference Savannah
This presentation chronicles the work of Houston County Schools in developing a risk prevention model built on more than ten years of longitudinal student data. In its second year of implementation, Houston At-Risk Profiles (HARP), has proven effective in identifying those students most in need of support and linking them to interventions and supports that lead to improved outcomes and significantly reduces the risk of failure.
Detection Of Porcine Reproductive And Respiratory Syndrome Virus (Prrsv)-Specific Igm-Iga In Oral Fluid Samples Reveals Prrsv Infection In The Presence Of Maternal Antibody, 2018 Iowa State University
Detection Of Porcine Reproductive And Respiratory Syndrome Virus (Prrsv)-Specific Igm-Iga In Oral Fluid Samples Reveals Prrsv Infection In The Presence Of Maternal Antibody, Marisa L. Rotolo, Luis Giménez-Lirola, Ju Ji, Ronaldo Magtoto, Yuly A. Henao-Díaz, Chong Wang, David H. Baum, Karen M. Harmon, Rodger G. Main, Jeffrey J. Zimmerman
The ontogeny of PRRSV antibody in oral fluids has been described using isotype-specific ELISAs. Mirroring the serum response, IgM appears in oral fluid by 7 days post inoculation (DPI), IgA after 7 DPI, and IgG by 9 to 10 DPI. Commercial PRRSV ELISAs target the detection of IgG because the higher concentration of IgG relative to other isotypes provides the best diagnostic discrimination. Oral fluids are increasingly used for PRRSV surveillance in commercial herds, but in younger pigs, a positive ELISA result may be due either to maternal antibody or to antibody produced by the pigs in response to infection ...
Optimized Adaptive Enrichment Designs For Multi-Arm Trials: Learning Which Subpopulations Benefit From Different Treatments, 2018 Department of Biostatistics, Brown School of Public Health
Optimized Adaptive Enrichment Designs For Multi-Arm Trials: Learning Which Subpopulations Benefit From Different Treatments, Jon Arni Steingrimsson, Joshua Betz, Tiachen Qian, Michael Rosenblum
Johns Hopkins University, Dept. of Biostatistics Working Papers
We consider the problem of designing a randomized trial for comparing two treatments versus a common control in two disjoint subpopulations. The subpopulations could be defined in terms of a biomarker or disease severity measured at baseline. The goal is to determine which treatments benefit which subpopulations. We develop a new class of adaptive enrichment designs tailored to solving this problem. Adaptive enrichment designs involve a preplanned rule for modifying enrollment based on accruing data in an ongoing trial. The proposed designs have preplanned rules for stopping accrual of treatment by subpopulation combinations, either for efficacy or futility. The motivation ...
Phase Ii Adaptive Enrichment Design To Determine The Population To Enroll In Phase Iii Trials, By Selecting Thresholds For Baseline Disease Severity, 2018 Johns Hopkins Bloomberg School of Public Health, Department of Biostatistics
Phase Ii Adaptive Enrichment Design To Determine The Population To Enroll In Phase Iii Trials, By Selecting Thresholds For Baseline Disease Severity, Yu Du, Gary L. Rosner, Michael Rosenblum
Johns Hopkins University, Dept. of Biostatistics Working Papers
We propose and evaluate a two-stage, phase 2, adaptive clinical trial design. Its goal is to determine whether future phase 3 (confirmatory) trials should be conducted, and if so, which population should be enrolled. The population selected for phase 3 enrollment is defined in terms of a disease severity score measured at baseline. We optimize the phase 2 trial design and analysis in a decision theory framework. Our utility function represents a combination of the cost of conducting phase 3 trials and, if the phase 3 trials are successful, the improved health of the future population minus the cost of ...
Comparing Various Machine Learning Statistical Methods Using Variable Differentials To Predict College Basketball, 2018 The University of Akron
Comparing Various Machine Learning Statistical Methods Using Variable Differentials To Predict College Basketball, Nicholas Bennett
Honors Research Projects
The purpose of this Senior Honors Project is to research, study, and demonstrate newfound knowledge of various machine learning statistical techniques that are not covered in the University of Akron’s statistics major curriculum. This report will be an overview of three machine-learning methods that were used to predict NCAA Basketball results, specifically, the March Madness tournament. The variables used for these methods, models, and tests will include numerous variables kept throughout the season for each team, along with a couple variables that are used by the selection committee when tournament teams are being picked. The end goal is to ...