Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistics and Probability

Model selection

Institution
Publication Year
Publication
Publication Type

Articles 1 - 30 of 41

Full-Text Articles in Physical Sciences and Mathematics

A Novel Correction For The Adjusted Box-Pierce Test, Sidy Danioko, Jianwei Zheng, Kyle Anderson, Alexander Barrett, Cyril S. Rakovski May 2022

A Novel Correction For The Adjusted Box-Pierce Test, Sidy Danioko, Jianwei Zheng, Kyle Anderson, Alexander Barrett, Cyril S. Rakovski

Mathematics, Physics, and Computer Science Faculty Articles and Research

The classical Box-Pierce and Ljung-Box tests for auto-correlation of residuals possess severe deviations from nominal type I error rates. Previous studies have attempted to address this issue by either revising existing tests or designing new techniques. The Adjusted Box-Pierce achieves the best results with respect to attaining type I error rates closer to nominal values. This research paper proposes a further correction to the adjusted Box-Pierce test that possesses near perfect type I error rates. The approach is based on an inflation of the rejection region for all sample sizes and lags calculated via a linear model applied to simulated …


Sparse Model Selection Using Information Complexity, Yaojin Sun May 2022

Sparse Model Selection Using Information Complexity, Yaojin Sun

Doctoral Dissertations

This dissertation studies and uses the application of information complexity to statistical model selection through three different projects. Specifically, we design statistical models that incorporate sparsity features to make the models more explanatory and computationally efficient.

In the first project, we propose a Sparse Bridge Regression model for variable selection when the number of variables is much greater than the number of observations if model misspecification occurs. The model is demonstrated to have excellent explanatory power in high-dimensional data analysis through numerical simulations and real-world data analysis.

The second project proposes a novel hybrid modeling method that utilizes a mixture …


Beta Mixture And Contaminated Model With Constraints And Application With Micro-Array Data, Ya Qi Jan 2022

Beta Mixture And Contaminated Model With Constraints And Application With Micro-Array Data, Ya Qi

Theses and Dissertations--Statistics

This dissertation research is concentrated on the Contaminated Beta(CB) model and its application in micro-array data analysis. Modified Likelihood Ratio Test (MLRT) introduced by [Chen et al., 2001] is used for testing the omnibus null hypothesis of no contamination of Beta(1,1)([Dai and Charnigo, 2008]). We design constraints for two-component CB model, which put the mode toward the left end of the distribution to reflect the abundance of small p-values of micro-array data, to increase the test power. A three-component CB model might be useful when distinguishing high differentially expressed genes and moderate differentially expressed genes. If the null hypothesis above …


Contrasting Cumulative Risk And Multiple Individual Risk Models Of The Relationship Between Adverse Childhood Experiences (Aces) And Adult Health Outcomes, Marianna Lanoue, Brandon George, Deborah L Helitzer, Scott W Keith Sep 2020

Contrasting Cumulative Risk And Multiple Individual Risk Models Of The Relationship Between Adverse Childhood Experiences (Aces) And Adult Health Outcomes, Marianna Lanoue, Brandon George, Deborah L Helitzer, Scott W Keith

College of Population Health Faculty Papers

BACKGROUND: A very large body of research documents relationships between self-reported Adverse Childhood Experiences (srACEs) and adult health outcomes. Despite multiple assessment tools that use the same or similar questions, there is a great deal of inconsistency in the operationalization of self-reported childhood adversity for use as a predictor variable. Alternative conceptual models are rarely used and very limited evidence directly contrasts conceptual models to each other. Also, while a cumulative numeric 'ACE Score' is normative, there are differences in the way it is calculated and used in statistical models. We investigated differences in model fit and performance between the …


Joint Models Of Longitudinal Outcomes And Informative Time, Jangdong Seo Jun 2020

Joint Models Of Longitudinal Outcomes And Informative Time, Jangdong Seo

Journal of Modern Applied Statistical Methods

Longitudinal data analyses commonly assume that time intervals are predetermined and have no information regarding the outcomes. However, there might be irregular time intervals and informative time. Presented are joint models and asymptotic behaviors of the parameter estimates. Also, the models are applied for real data sets.


Forecasting Daily Stock Market Return With Multiple Linear Regression, Shengxuan Chen May 2020

Forecasting Daily Stock Market Return With Multiple Linear Regression, Shengxuan Chen

Mathematics Senior Capstone Papers

The purpose of this project is to use data mining and big data analytic techniques to forecast daily stock market return with multiple linear regression. Using mathematical and statistical models to analyze the stock market is important and challenging. The accuracy of the final results relies on the quality of the input data and the validity of the methodology. In the report, within 5-year period, the data regarding eleven financial and economical features are observed and recorded on each trading day. After preprocessing the raw data with statistical method, we use the multiple linear regression to predict the daily return …


Serial Testing For Detection Of Multilocus Genetic Interactions, Zaid T. Al-Khaledi Jan 2019

Serial Testing For Detection Of Multilocus Genetic Interactions, Zaid T. Al-Khaledi

Theses and Dissertations--Statistics

A method to detect relationships between disease susceptibility and multilocus genetic interactions is the Multifactor-Dimensionality Reduction (MDR) technique pioneered by Ritchie et al. (2001). Since its introduction, many extensions have been pursued to deal with non-binary outcomes and/or account for multiple interactions simultaneously. Studying the effects of multilocus genetic interactions on continuous traits (blood pressure, weight, etc.) is one case that MDR does not handle. Culverhouse et al. (2004) and Gui et al. (2013) proposed two different methods to analyze such a case. In their research, Gui et al. (2013) introduced the Quantitative Multifactor-Dimensionality Reduction (QMDR) that uses the overall …


Examining The Confirmatory Tetrad Analysis (Cta) As A Solution Of The Inadequacy Of Traditional Structural Equation Modeling (Sem) Fit Indices, Hangcheng Liu Jan 2018

Examining The Confirmatory Tetrad Analysis (Cta) As A Solution Of The Inadequacy Of Traditional Structural Equation Modeling (Sem) Fit Indices, Hangcheng Liu

Theses and Dissertations

Structural Equation Modeling (SEM) is a framework of statistical methods that allows us to represent complex relationships between variables. SEM is widely used in economics, genetics and the behavioral sciences (e.g. psychology, psychobiology, sociology and medicine). Model complexity is defined as a model’s ability to fit different data patterns and it plays an important role in model selection when applying SEM. As in linear regression, the number of free model parameters is typically used in traditional SEM model fit indices as a measure of the model complexity. However, only using number of free model parameters to indicate SEM model complexity …


Information Metrics For Predictive Modeling And Machine Learning, Kostantinos Gourgoulias Jul 2017

Information Metrics For Predictive Modeling And Machine Learning, Kostantinos Gourgoulias

Doctoral Dissertations

The ever-increasing complexity of the models used in predictive modeling and data science and their use for prediction and inference has made the development of tools for uncertainty quantification and model selection especially important. In this work, we seek to understand the various trade-offs associated with the simulation of stochastic systems. Some trade-offs are computational, e.g., execution time of an algorithm versus accuracy of simulation. Others are analytical: whether or not we are able to find tractable substitutes for quantities of interest, e.g., distributions, ergodic averages, etc. The first two chapters of this thesis deal with the study of the …


Approximate Statistical Solutions To The Forensic Identification Of Source Problem, Danica M. Ommen Jan 2017

Approximate Statistical Solutions To The Forensic Identification Of Source Problem, Danica M. Ommen

Electronic Theses and Dissertations

Currently in forensic science, the statistical methods for solving the identification of source problems are inherently subjective and generally ad-hoc. The formal Bayesian decision framework provides the most statistically rigorous foundation for these problems to date. However, computing a solution under this framework, which relies on a Bayes Factor, tends to be computationally intensive and highly sensitive to the subjective choice of prior distributions for the parameters. Therefore, this dissertation aims to develop statistical solutions to the forensic identification of source problems which are less subjective, but which retain the statistical rigor of the Bayesian solution. First, this dissertation focuses …


Inference Using Bhattacharyya Distance To Model Interaction Effects When The Number Of Predictors Far Exceeds The Sample Size, Sarah A. Janse Jan 2017

Inference Using Bhattacharyya Distance To Model Interaction Effects When The Number Of Predictors Far Exceeds The Sample Size, Sarah A. Janse

Theses and Dissertations--Statistics

In recent years, statistical analyses, algorithms, and modeling of big data have been constrained due to computational complexity. Further, the added complexity of relationships among response and explanatory variables, such as higher-order interaction effects, make identifying predictors using standard statistical techniques difficult. These difficulties are only exacerbated in the case of small sample sizes in some studies. Recent analyses have targeted the identification of interaction effects in big data, but the development of methods to identify higher-order interaction effects has been limited by computational concerns. One recently studied method is the Feasible Solutions Algorithm (FSA), a fast, flexible method that …


What Affects Parents’ Choice Of Milk? An Application Of Bayesian Model Averaging, Yingzhe Cheng Dec 2016

What Affects Parents’ Choice Of Milk? An Application Of Bayesian Model Averaging, Yingzhe Cheng

Mathematics & Statistics ETDs

This study identifies the factors that influence parents’ choice of milk for their children, using data from a unique survey administered in 2013 in Hunan province, China. In this survey, we identified two brands of milk, which differ in their prices and safety claims by the producer. Data were collected on parents’ choice of milk between the two brands, demographics, attitude towards food safety and behaviors related to food. Stepwise model selection and Bayesian model averaging (BMA) are used to search for influential factors. The two approaches consistently select the same factors suggested by an economic theoretical model, including price …


Selecting Spatial Scale Of Area-Level Covariates In Regression Models, Lauren Grant Jan 2016

Selecting Spatial Scale Of Area-Level Covariates In Regression Models, Lauren Grant

Theses and Dissertations

Studies have found that the level of association between an area-level covariate and an outcome can vary depending on the spatial scale (SS) of a particular covariate. However, covariates used in regression models are customarily modeled at the same spatial unit. In this dissertation, we developed four SS model selection algorithms that select the best spatial scale for each area-level covariate. The SS forward stepwise, SS incremental forward stagewise, SS least angle regression (LARS), and SS lasso algorithms allow for the selection of different area-level covariates at different spatial scales, while constraining each covariate to enter at most one spatial …


Variable Selection In Single Index Varying Coefficient Models With Lasso, Peng Wang Nov 2015

Variable Selection In Single Index Varying Coefficient Models With Lasso, Peng Wang

Doctoral Dissertations

Single index varying coefficient model is a very attractive statistical model due to its ability to reduce dimensions and easy-of-interpretation. There are many theoretical studies and practical applications with it, but typically without features of variable selection, and no public software is available for solving it. Here we propose a new algorithm to fit the single index varying coefficient model, and to carry variable selection in the index part with LASSO. The core idea is a two-step scheme which alternates between estimating coefficient functions and selecting-and-estimating the single index. Both in simulation and in application to a Geoscience dataset, we …


Model Selection For Gaussian Mixture Models For Uncertainty Qualification, Yiyi Chen, Guang Lin, Xuan Liu Aug 2015

Model Selection For Gaussian Mixture Models For Uncertainty Qualification, Yiyi Chen, Guang Lin, Xuan Liu

The Summer Undergraduate Research Fellowship (SURF) Symposium

Clustering is task of assigning the objects into different groups so that the objects are more similar to each other than in other groups. Gaussian Mixture model with Expectation Maximization method is the one of the most general ways to do clustering on large data set. However, this method needs the number of Gaussian mode as input(a cluster) so it could approximate the original data set. Developing a method to automatically determine the number of single distribution model will help to apply this method to more larger context. In the original algorithm, there is a variable represent the weight of …


The Information Criterion, Masume Ghahramani Nov 2014

The Information Criterion, Masume Ghahramani

Journal of Modern Applied Statistical Methods

The Akaike information criterion, AIC, is widely used for model selection. Using the AIC as the estimator of asymptotic unbias for the second term Kullbake-Leibler risk considers the divergence between the true model and offered models. However, it is an inconsistent estimator. A proposed approach the problem is the use of A'IC, a consistently offered information criterion. Model selection of classic and linear models are considered by a Monte Carlo simulation.


Seasonal Decomposition For Geographical Time Series Using Nonparametric Regression, Hyukjun Gweon Apr 2013

Seasonal Decomposition For Geographical Time Series Using Nonparametric Regression, Hyukjun Gweon

Electronic Thesis and Dissertation Repository

A time series often contains various systematic effects such as trends and seasonality. These different components can be determined and separated by decomposition methods. In this thesis, we discuss time series decomposition process using nonparametric regression. A method based on both loess and harmonic regression is suggested and an optimal model selection method is discussed. We then compare the process with seasonal-trend decomposition by loess STL (Cleveland, 1979). While STL works well when that proper parameters are used, the method we introduce is also competitive: it makes parameter choice more automatic and less complex. The decomposition process often requires that …


Derivative Estimation With Local Polynomial Fitting, Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irene Gijbels Jan 2013

Derivative Estimation With Local Polynomial Fitting, Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irene Gijbels

Kris De Brabanter

We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are …


A Systematic Selection Method For The Development Of Cancer Staging Systems, Yunzhi Lin, Richard Chappell, Mithat Gonen Jan 2012

A Systematic Selection Method For The Development Of Cancer Staging Systems, Yunzhi Lin, Richard Chappell, Mithat Gonen

Memorial Sloan-Kettering Cancer Center, Dept. of Epidemiology & Biostatistics Working Paper Series

The tumor-node-metastasis (TNM) staging system has been the anchor of cancer diagnosis, treatment, and prognosis for many years. For meaningful clinical use, an orderly, progressive condensation of the T and N categories into an overall staging system needs to be defined, usually with respect to a time-to-event outcome. This can be considered as a cutpoint selection problem for a censored response partitioned with respect to two ordered categorical covariates and their interaction. The aim is to select the best grouping of the TN categories. A novel bootstrap cutpoint/model selection method is proposed for this task by maximizing bootstrap estimates of …


Extracting Information From Functional Connectivity Maps Via Function-On-Scalar Regression, Philip T. Reiss, Maarten Mennes, Eva Petkova, Lei Huang, Matthew J. Hoptman, Bharat B. Biswal, Stanley J. Colcombe, Xi-Nian Zuo, Michael P. Milham Dec 2010

Extracting Information From Functional Connectivity Maps Via Function-On-Scalar Regression, Philip T. Reiss, Maarten Mennes, Eva Petkova, Lei Huang, Matthew J. Hoptman, Bharat B. Biswal, Stanley J. Colcombe, Xi-Nian Zuo, Michael P. Milham

Lei Huang

Functional connectivity of an individual human brain is often studied by acquiring a resting state functional magnetic resonance imaging scan, and mapping the correlation of each voxel's BOLD time series with that of a seed region. As large collections of such maps become available, including multisite data sets, there is an increasing need for ways to distill the information in these maps in a readily visualized form. Here we propose a two-step analytic strategy. First, we construct connectivity-distance profiles, which summarize the connectivity of each voxel in the brain as a function of distance from the seed, a functional relationship …


Extracting Information From Functional Connectivity Maps Via Function-On-Scalar Regression, Philip T. Reiss, Maarten Mennes, Eva Petkova, Lei Huang, Matthew J. Hoptman, Bharat B. Biswal, Stanley J. Colcombe, Xi-Nian Zuo, Michael P. Milham Dec 2010

Extracting Information From Functional Connectivity Maps Via Function-On-Scalar Regression, Philip T. Reiss, Maarten Mennes, Eva Petkova, Lei Huang, Matthew J. Hoptman, Bharat B. Biswal, Stanley J. Colcombe, Xi-Nian Zuo, Michael P. Milham

Philip T. Reiss

Functional connectivity of an individual human brain is often studied by acquiring a resting state functional magnetic resonance imaging scan, and mapping the correlation of each voxel's BOLD time series with that of a seed region. As large collections of such maps become available, including multisite data sets, there is an increasing need for ways to distill the information in these maps in a readily visualized form. Here we propose a two-step analytic strategy. First, we construct connectivity-distance profiles, which summarize the connectivity of each voxel in the brain as a function of distance from the seed, a functional relationship …


Model Selection With Information Criteria, Changjiang Xu Oct 2010

Model Selection With Information Criteria, Changjiang Xu

Electronic Thesis and Dissertation Repository

This thesis is on model selection using information criteria. The information criteria include generalized information criterion and a family of Bayesian information criteria. The properties and improvement of the information criteria are investigated.

We analyze nonasymptotic and asymptotic properties of the information criteria for linear models, probabilistic models, and high dimensional models, respectively. We give probability of selecting a model and compute the probability by Monte Carlo methods. We derive the conditions under which the criteria are consistent, underfitting, or overfitting.

We further propose new model selection procedures to improve the information criteria. The procedures combine the information criteria with …


A Comparative Study Of Bayesian Model Selection Criteria For Capture-Recapture Models For Closed Populations, Ross M. Gosky, Sujit K. Ghosh May 2009

A Comparative Study Of Bayesian Model Selection Criteria For Capture-Recapture Models For Closed Populations, Ross M. Gosky, Sujit K. Ghosh

Journal of Modern Applied Statistical Methods

Capture-Recapture models estimate unknown population sizes. Eight standard closed population models exist, allowing for time, behavioral, and heterogeneity effects. Bayesian versions of these models are presented and use of Akaike's Information Criterion (AIC) and the Deviance Information Criterion (DIC) are explored as model selection tools, through simulation and real dataset analysis.


Practical Unit-Root Analysis Using Information Criteria: Simulation Evidence, Kosei Fukuda May 2007

Practical Unit-Root Analysis Using Information Criteria: Simulation Evidence, Kosei Fukuda

Journal of Modern Applied Statistical Methods

The information-criterion-based model selection method for detecting a unit root is proposed. The simulation results suggest that the performances of the proposed method are usually comparable to and sometimes better than those of the conventional unit-root tests. The advantages of the proposed method in practical applications are also discussed.


Selecting The Best Linear Mixed Model Using Predictive Approaches, Jun Wang Jan 2007

Selecting The Best Linear Mixed Model Using Predictive Approaches, Jun Wang

Theses and Dissertations

The linear mixed model is widely implemented in the analysis of longitudinal data. Inference techniques and information criteria are available and well-studied for goodness-of-fit within the linear mixed model setting. Predictive approaches such as R-squared, PRESS, and CCC are available for the linear mixed model but require more research (Edward, 2005). This project used simulation to investigate the performance of R-squared, PRESS, CCC, Pseudo F-test and information criterion for goodness-of-fit within the linear mixed model framework. Marginal and conditional approaches for these predictive statistics were studied under different variance-covariance structures. For compound symmetry structure, the success rates for all 17 …


A Logistic Regression Analysis Of Utah Colleges Exit Poll Response Rates Using Sas Software, Clint W. Stevenson Oct 2006

A Logistic Regression Analysis Of Utah Colleges Exit Poll Response Rates Using Sas Software, Clint W. Stevenson

Theses and Dissertations

In this study I examine voter response at an interview level using a dataset of 7562 voter contacts (including responses and nonresponses) in the 2004 Utah Colleges Exit Poll. In 2004, 4908 of the 7562 voters approached responded to the exit poll for an overall response rate of 65 percent. Logistic regression is used to estimate factors that contribute to a success or failure of each interview attempt. This logistic regression model uses interviewer characteristics, voter characteristics (both respondents and nonrespondents), and exogenous factors as independent variables. Voter characteristics such as race, gender, and age are strongly associated with response. …


Model-Selection-Based Monitoring Of Structural Change, Kosei Fukuda May 2005

Model-Selection-Based Monitoring Of Structural Change, Kosei Fukuda

Journal of Modern Applied Statistical Methods

Monitoring structural change is performed not by hypothesis testing but by model selection using a modified Bayesian information criterion. It is found that concerning detection accuracy and detection speed, the proposed method shows better performance than the hypothesis-testing method. Two advantages of the proposed method are also discussed.


The False Discovery Rate: A Variable Selection Perspective, Debashis Ghosh, Wei Chen, Trivellore E. Raghuanthan Jun 2004

The False Discovery Rate: A Variable Selection Perspective, Debashis Ghosh, Wei Chen, Trivellore E. Raghuanthan

The University of Michigan Department of Biostatistics Working Paper Series

In many scientific and medical settings, large-scale experiments are generating large quantities of data that lead to inferential problems involving multiple hypotheses. This has led to recent tremendous interest in statistical methods regarding the false discovery rate (FDR). Several authors have studied the properties involving FDR in a univariate mixture model setting. In this article, we turn the problem on its side; in this manuscript, we show that FDR is a by-product of Bayesian analysis of variable selection problem for a hierarchical linear regression model. This equivalence gives many Bayesian insights as to why FDR is a natural quantity to …


Multiple Testing Methods For Chip-Chip High Density Oligonucleotide Array Data, Sunduz Keles, Mark J. Van Der Laan, Sandrine Dudoit, Simon E. Cawley Jun 2004

Multiple Testing Methods For Chip-Chip High Density Oligonucleotide Array Data, Sunduz Keles, Mark J. Van Der Laan, Sandrine Dudoit, Simon E. Cawley

U.C. Berkeley Division of Biostatistics Working Paper Series

Cawley et al. (2004) have recently mapped the locations of binding sites for three transcription factors along human chromosomes 21 and 22 using ChIP-Chip experiments. ChIP-Chip experiments are a new approach to the genome-wide identification of transcription factor binding sites and consist of chromatin (Ch) immunoprecipitation (IP) of transcription factor-bound genomic DNA followed by high density oligonucleotide hybridization (Chip) of the IP-enriched DNA. We investigate the ChIP-Chip data structure and propose methods for inferring the location of transcription factor binding sites from these data. The proposed methods involve testing for each probe whether it is part of a bound sequence …


Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms In Estimation, Sandra E. Sinisi, Mark J. Van Der Laan Mar 2004

Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms In Estimation, Sandra E. Sinisi, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

In van der Laan and Dudoit (2003) we propose and theoretically study a unified loss function based statistical methodology, which provides a road map for estimation and performance assessment. Given a parameter of interest which can be described as the minimizer of the population mean of a loss function, the road map involves as important ingredients cross-validation for estimator selection and minimizing over subsets of basis functions the empirical risk of the subset-specific estimator of the parameter of interest, where the basis functions correspond to a parameterization of a specified subspace of the complete parameter space. In this article we …