A Robust Calibration-Assisted Method For Linear Mixed Effects Model Under Cluster-Specific Nonignorable Missingness, 2019 Seoul National University

#### A Robust Calibration-Assisted Method For Linear Mixed Effects Model Under Cluster-Specific Nonignorable Missingness, Yongchan Kwon, Jae Kwang Kim, Myunghee Cho Paik, Hongsoo Kim

*Jae Kwang Kim*

We propose a method for linear mixed effects models when the covariates are completely observed but the outcome of interest is subject to missing under cluster-specific nonignorable (CSNI) missingness. Our strategy is to replace missing quantities in the full-data objective function with unbiased predictors derived from inverse probability weighting and calibration technique. The proposed approach can be applied to estimating equations or likelihood functions with modified E-step, and does not require numerical integration as do previous methods. Unlike usual inverse probability weighting, the proposed method does not require correct specification of the response model as long as the CSNI assumption ...

Combining Non-Probability And Probability Survey Samples Through Mass Imputation, 2019 Iowa State University

#### Combining Non-Probability And Probability Survey Samples Through Mass Imputation, Jae Kwang Kim, Seho Park, Yilin Chen, Changbao Wu

*Jae Kwang Kim*

This paper presents theoretical results on combining non-probability and probability survey samples through mass imputation, an approach originally proposed by Rivers (2007) as sample matching without rigorous theoretical justification. Under suitable regularity conditions, we establish the consistency of the mass imputation estimator and derive its asymptotic variance formula. Variance estimators are developed using either linearization or bootstrap. Finite sample performances of the mass imputation estimator are investigated through simulation studies and an application to analyzing a non-probability sample collected by the Pew Research Centre.

An Approximate Bayesian Approach To Regression Estimation With Many Auxiliary Variables, 2019 The University of Tokyo

#### An Approximate Bayesian Approach To Regression Estimation With Many Auxiliary Variables, Shonosuke Sugasawa, Jae Kwang Kim

*Jae Kwang Kim*

Model-assisted estimation with complex survey data is an important practical problem in survey sampling. When there are many auxiliary variables, selecting significant variables associated with the study variable would be necessary to achieve efficient estimation of population parameters of interest. In this paper, we formulate a regularized regression estimator in the framework of Bayesian inference using the penalty function as the shrinkage prior for model selection. The proposed Bayesian approach enables us to get not only efficient point estimates but also reasonable credible intervals for population means. Results from two limited simulation studies are presented to facilitate comparison with existing ...

Hypotheses Testing From Complex Survey Data Using Bootstrap Weights: A Unified Approach, 2019 Iowa State University

#### Hypotheses Testing From Complex Survey Data Using Bootstrap Weights: A Unified Approach, Jae Kwang Kim, J. N. K. Rao, Zhonglei Wang

*Jae Kwang Kim*

Standard statistical methods that do not take proper account of the complexity of survey design can lead to erroneous inferences when applied to survey data due to unequal selection probabilities, clustering, and other design features. In particular, the actual type I error rates of tests of hypotheses based on standard tests can be much bigger than the nominal significance level. Methods that take account of survey design features in testing hypotheses have been proposed, including Wald tests and quasi-score tests that involve the estimated covariance matrices of parameter estimates. Bootstrap methods designed for survey data are often applied to estimate ...

Accounting For Model Uncertainty In Multiple Imputation Under Complex Sampling, 2019 Kansas State University

#### Accounting For Model Uncertainty In Multiple Imputation Under Complex Sampling, Gyuhyeong Goh, Jae Kwang Kim

*Jae Kwang Kim*

Multiple imputation provides an effective way to handle missing data. When several possible models are under consideration for the data, the multiple imputation is typically performed under a single-best model selected from the candidate models. This single model selection approach ignores the uncertainty associated with the model selection and so leads to underestimation of the variance of multiple imputation estimator. In this paper, we propose a new multiple imputation procedure incorporating model uncertainty in the final inference. The proposed method incorporates possible candidate models for the data into the imputation procedure using the idea of Bayesian Model Averaging (BMA). The ...

Fully Bayesian Analysis Of Allele-Specific Rna-Seq Data, 2019 Universidad de la Republica

#### Fully Bayesian Analysis Of Allele-Specific Rna-Seq Data, Ignacio Alvarez-Castro, Jarad Niemi

*Statistics Publications*

Diploid organisms have two copies of each gene, called alleles, that can be separately transcribed. The RNA abundance associated to any particular allele is known as allele-specific expression (ASE). When two alleles have polymorphisms in transcribed regions, ASE can be studied using RNA-seq read count data. ASE has characteristics different from the regular RNA-seq expression: ASE cannot be assessed for every gene, measures of ASE can be biased towards one of the alleles (reference allele), and ASE provides two measures of expression for a single gene for each biological samples with leads to additional complications for single-gene models. We present ...

Machine Learning In Support Of Electric Distribution Asset Failure Prediction, 2019 Southern Methodist University

#### Machine Learning In Support Of Electric Distribution Asset Failure Prediction, Robert D. Flamenbaum, Thomas Pompo, Christopher Havenstein, Jade Thiemsuwan

*SMU Data Science Review*

In this paper, we present novel approaches to predicting as- set failure in the electric distribution system. Failures in overhead power lines and their associated equipment in particular, pose significant finan- cial and environmental threats to electric utilities. Electric device failure furthermore poses a burden on customers and can pose serious risk to life and livelihood. Working with asset data acquired from an electric utility in Southern California, and incorporating environmental and geospatial data from around the region, we applied a Random Forest methodology to predict which overhead distribution lines are most vulnerable to fail- ure. Our results provide evidence ...

Principal Component Neural Networks For Modeling, Prediction, And Optimization Of Hot Mix Asphalt Dynamics Modulus, 2019 Iowa State University

#### Principal Component Neural Networks For Modeling, Prediction, And Optimization Of Hot Mix Asphalt Dynamics Modulus, Parnian Ghasemi, Mohamad Aslani, Derrick K. Rollins, R. Christopher Williams

*Derrick K Rollins, Sr.*

The dynamic modulus of hot mix asphalt (HMA) is a fundamental material property that defines the stress-strain relationship based on viscoelastic principles and is a function of HMA properties, loading rate, and temperature. Because of the large number of efficacious predictors (factors) and their nonlinear interrelationships, developing predictive models for dynamic modulus can be a challenging task. In this research, results obtained from a series of laboratory tests including mixture dynamic modulus, aggregate gradation, dynamic shear rheometer (on asphalt binder), and mixture volumetric are used to create a database. The created database is used to develop a model for estimating ...

Optimizing Ensemble Weights And Hyperparameters Of Machine Learning Models For Regression Problems, 2019 Iowa State University

#### Optimizing Ensemble Weights And Hyperparameters Of Machine Learning Models For Regression Problems, Mohsen Shahhosseini, Guiping Hu, Hieu Pham

*Guiping Hu*

Aggregating multiple learners through an ensemble of models aims to make better predictions by capturing the underlying distribution more accurately. Different ensembling methods, such as bagging, boosting and stacking/blending, have been studied and adopted extensively in research and practice. While bagging and boosting intend to reduce variance and bias, respectively, blending approaches target both by finding the optimal way to combine base learners to find the best trade-off between bias and variance. In blending, ensembles are created from weighted averages of multiple base learners. In this study, a systematic approach is proposed to find the optimal weights to create ...

Towards Using Model Averaging To Construct Confidence Intervals In Logistic Regression Models, 2019 The University of Western Ontario

#### Towards Using Model Averaging To Construct Confidence Intervals In Logistic Regression Models, Artem Uvarov

*Electronic Thesis and Dissertation Repository*

Regression analyses in epidemiological and medical research typically begin with a model selection process, followed by inference assuming the selected model has generated the data at hand. It is well-known that this two-step procedure can yield biased estimates and invalid confidence intervals for model coefficients due to the uncertainty associated with the model selection. To account for this uncertainty, multiple models may be selected as a basis for inference. This method, commonly referred to as model-averaging, is increasingly becoming a viable approach in practice.

Previous research has demonstrated the advantage of model-averaging in reducing bias of parameter estimates. However, there ...

Identifying Undervalued Players In Fantasy Football, 2019 Southern Methodist University

#### Identifying Undervalued Players In Fantasy Football, Christopher D. Morgan, Caroll Rodriguez, Korey Macvittie, Robert Slater, Daniel W. Engels

*SMU Data Science Review*

In this paper we present a model to predict player performance in fantasy football. In particular, identifying high-performance players can prove to be a difficult problem, as there are on occasion players capable of high performance whose past metrics give no indication of this capacity. These "sleepers"' are often undervalued, and the acquisition of such players can have notable impact on a fantasy football team's overall performance. We constructed a regression model that accounts for players' past performance and athletic metrics to predict their future performance. The model we built performs favorably in predicting athlete performance in relation to ...

Neutral Diagnosis: An Innovative Concept For Medical Device Clinical Trials, 2019 University of Massachusetts Medical School

#### Neutral Diagnosis: An Innovative Concept For Medical Device Clinical Trials, Bo Zhang, Shangyuan Ye, Sravya B. Shankara, Hui Zhang, Qingfeng Zheng

*Open Access Articles*

Study design and statistical analysis are crucial in pivotal clinical trials to evaluate the effectiveness and safety of new medical devices under investigation. In recent years, innovative intraoperative in vivo breast tumor diagnostic devices have been proposed to improve the accuracy and surgical outcomes of breast tumor patients undergoing resection. Although such technologies are promising, investigators need to obtain statistical evidence for the effectiveness and safety of these devices by conducting valid clinical trials. However, the study design and statistical analysis for these clinical trials are complicated. While these trials are designed to provide real-time intraoperative diagnosis of cancerous tissue ...

Declining Liquidity In Iowa Farms: 2014–2017, 2019 Iowa State University

#### Declining Liquidity In Iowa Farms: 2014–2017, Alejandro Plastina

*Alejandro Plastina*

The goal of the present study is to describe the evolution of financial liquidity in Iowa farms for 2014–2017, using a unique panel of 220 mid-scale commercial farms. Farms with vulnerable liquidity ratings increased from 33.2 percent in December 2014 to 45.0 percent in December 2017. On average, farms lost $244 of working capital per acre over that period, but farms with vulnerable liquidity ratings in December 2017 lost almost 60 percent more than that, or $388. Average farm size, machinery investment per acre, farm net worth per acre, debt-to-asset ratio, and age of operator were not ...

Semiparametric Fractional Imputation Using Gaussian Mixture Models For Handling Multivariate Missing Data, 2019 Google Inc

#### Semiparametric Fractional Imputation Using Gaussian Mixture Models For Handling Multivariate Missing Data, Hejian Sang, Jae Kwang Kim

*Jae Kwang Kim*

Item nonresponse is frequently encountered in practice. Ignoring missing data can lose efficiency and lead to misleading inference. Fractional imputation is a frequentist approach of imputation for handling missing data. However, the parametric fractional imputation of Kim (2011) may be subject to bias under model misspecification. In this paper, we propose a novel semiparametric fractional imputation method using Gaussian mixture models. The proposed method is computationally efficient and leads to robust estimation. The proposed method is further extended to incorporate the categorical auxiliary information. The asymptotic model consistency and √n- consistency of the semiparametric fractional imputation estimator are also established ...

Integration Of Survey Data And Big Observational Data For Finite Population Inference Using Mass Imputation, 2019 North Carolina State University

#### Integration Of Survey Data And Big Observational Data For Finite Population Inference Using Mass Imputation, Shu Yang, Jae Kwang Kim

*Jae Kwang Kim*

Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining a probability sample with big observational data. Unlike the usual imputation for missing data analysis, we create imputed values for the whole elements in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented ...

Bootstrap Inference For The Finite Population Total Under Complex Sampling Designs, 2019 Xiamen University

#### Bootstrap Inference For The Finite Population Total Under Complex Sampling Designs, Zhonglei Wang, Jae Kwang Kim, Liuhua Peng

*Jae Kwang Kim*

Bootstrap is a useful tool for making statistical inference, but it may provide erroneous results under complex survey sampling. Most studies about bootstrap-based inference are developed under simple random sampling and stratified random sampling. In this paper, we propose a unified bootstrap method applicable to some complex sampling designs, including Poisson sampling and probability-proportional-to-size sampling. Two main features of the proposed bootstrap method are that studentization is used to make inference, and the finite population is bootstrapped based on a multinomial distribution by incorporating the sampling information. We show that the proposed bootstrap method is second-order accurate using the Edgeworth ...

Bottom-Up Estimation And Top-Down Prediction: Solar Energy Prediction Combining Information From Multiple Sources, 2019 Sungkyunkwan University

#### Bottom-Up Estimation And Top-Down Prediction: Solar Energy Prediction Combining Information From Multiple Sources, Youngdeok Hwang, Siyuan Lu, Jae Kwang Kim

*Jae Kwang Kim*

Accurately forecasting solar power using the data from multiple sources is an important but challenging problem. Our goal is to combine two different physics model forecasting outputs with real measurements from an automated monitoring network so as to better predict solar power in a timely manner. To this end, we propose a new approach of analyzing large-scale multilevel models with great computational efficiency requiring minimum monitoring and intervention. This approach features a division of the large scale data set into smaller ones with manageable sizes, based on their physical locations, and fit a local model in each area. The local ...

Machine Learning Predicts Aperiodic Laboratory Earthquakes, 2019 Southern Methodist University

#### Machine Learning Predicts Aperiodic Laboratory Earthquakes, Olha Tanyuk, Daniel Davieau, Charles South, Daniel W. Engels

*SMU Data Science Review*

In this paper we find a pattern of aperiodic seismic signals that precede earthquakes at any time in a laboratory earthquake’s cycle using a small window of time. We use a data set that comes from a classic laboratory experiment having several stick-slip displacements (earthquakes), a type of experiment which has been studied as a simulation of seismologic faults for decades. This data exhibits similar behavior to natural earthquakes, so the same approach may work in predicting the timing of them. Here we show that by applying random forest machine learning technique to the acoustic signal emitted by a ...

Longitudinal Analysis With Modes Of Operation For Aes, 2019 Southern Methodist University

#### Longitudinal Analysis With Modes Of Operation For Aes, Dana Geislinger, Cory Thigpen, Daniel W. Engels

*SMU Data Science Review*

In this paper, we present an empirical evaluation of the randomness of the ciphertext blocks generated by the Advanced Encryption Standard (AES) cipher in Counter (CTR) mode and in Cipher Block Chaining (CBC) mode. Vulnerabilities have been found in the AES cipher that may lead to a reduction in the randomness of the generated ciphertext blocks that can result in a practical attack on the cipher. We evaluate the randomness of the AES ciphertext using the standard key length and NIST randomness tests. We evaluate the randomness through a longitudinal analysis on 200 billion ciphertext blocks using logistic regression and ...

Bayesian And Positive Matrix Factorization Approaches To Pollution Source Apportionment, 2019 Brigham Young University - Provo

#### Bayesian And Positive Matrix Factorization Approaches To Pollution Source Apportionment, Jeff William Lingwall

*Jeff Lingwall*

The use of Positive Matrix Factorization (PMF) in pollution source apportionment (PSA) is examined and illustrated. A study of its settings is conducted in order to optimize them in the context of PSA. The use of a priori information in PMF is examined, in the form of target factor profiles and pulling profile elements to zero. A Bayesian model using lognormal prior distributions for source profiles and source contributions is fit and examined.