Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistics and Probability

Missing data

Institution
Publication Year
Publication
Publication Type

Articles 1 - 30 of 53

Full-Text Articles in Physical Sciences and Mathematics

Statistical Challenges And Methods For Missing And Imbalanced Data, Rose Adjei Dec 2022

Statistical Challenges And Methods For Missing And Imbalanced Data, Rose Adjei

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of …


Performance Comparison Of Imputation Methods For Mixed Data Missing At Random With Small And Large Sample Data Set With Different Variability, Kyei Afari Aug 2021

Performance Comparison Of Imputation Methods For Mixed Data Missing At Random With Small And Large Sample Data Set With Different Variability, Kyei Afari

Electronic Theses and Dissertations

One of the concerns in the field of statistics is the presence of missing data, which leads to bias in parameter estimation and inaccurate results. However, the multiple imputation procedure is a remedy for handling missing data. This study looked at the best multiple imputation methods used to handle mixed variable datasets with different sample sizes and variability along with different levels of missingness. The study employed the predictive mean matching, classification and regression trees, and the random forest imputation methods. For each dataset, the multiple regression parameter estimates for the complete datasets were compared to the multiple regression parameter …


Compare And Contrast Maximum Likelihood Method And Inverse Probability Weighting Method In Missing Data Analysis, Scott Sun May 2021

Compare And Contrast Maximum Likelihood Method And Inverse Probability Weighting Method In Missing Data Analysis, Scott Sun

Mathematical Sciences Technical Reports (MSTR)

Data can be lost for different reasons, but sometimes the missingness is a part of the data collection process. Unbiased and efficient estimation of the parameters governing the response mean model requires the missing data to be appropriately addressed. This paper compares and contrasts the Maximum Likelihood and Inverse Probability Weighting estimators in an Outcome-Dependendent Sampling design that deliberately generates incomplete observations. WE demonstrate the comparison through numerical simulations under varied conditions: different coefficient of determination, and whether or not the mean model is misspecified.


Performance Comparison Of Multiple Imputation Methods For Quantitative Variables For Small And Large Data With Differing Variability, Vincent Onyame May 2021

Performance Comparison Of Multiple Imputation Methods For Quantitative Variables For Small And Large Data With Differing Variability, Vincent Onyame

Electronic Theses and Dissertations

Missing data continues to be one of the main problems in data analysis as it reduces sample representativeness and consequently, causes biased estimates. Multiple imputation methods have been established as an effective method of handling missing data. In this study, we examined multiple imputation methods for quantitative variables on twelve data sets with varied sizes and variability that were pseudo generated from an original data. The multiple imputation methods examined are the predictive mean matching, Bayesian linear regression and linear regression, non-Bayesian in the MICE (Multiple Imputation Chain Equation) package in the statistical software, R. The parameter estimates generated from …


Imputation, Modelling And Optimal Sampling Design For Digital Camera Data In Recreational Fisheries Monitoring, Ebenezer Afrifa-Yamoah Jan 2021

Imputation, Modelling And Optimal Sampling Design For Digital Camera Data In Recreational Fisheries Monitoring, Ebenezer Afrifa-Yamoah

Theses: Doctorates and Masters

Digital camera monitoring has evolved as an active application-oriented scheme to help address questions in areas such as fisheries, ecology, computer vision, artificial intelligence, and criminology. In recreational fisheries research, digital camera monitoring has become a viable option for probability-based survey methods, and is also used for corroborative and validation purposes. In comparison to onsite surveys (e.g. boat ramp surveys), digital cameras provide a cost-effective method of monitoring boating activity and fishing effort, including night-time fishing activities. However, there are challenges in the use of digital camera monitoring that need to be resolved. Notably, missing data problems and the cost …


Almost All Missing Data Are Mnar, Thomas R. Knapp Sep 2020

Almost All Missing Data Are Mnar, Thomas R. Knapp

Journal of Modern Applied Statistical Methods

Rubin (1976, and elsewhere) claimed that there are three kinds of “missingness”: missing completely at random; missing at random; and missing not at random. He gave examples of each. The article that now follows takes an opposing view by arguing that almost all missing data are missing not at random.


Semiparametric And Nonparametric Methods For Comparing Biomarker Levels Between Groups, Yuntong Li Jan 2020

Semiparametric And Nonparametric Methods For Comparing Biomarker Levels Between Groups, Yuntong Li

Theses and Dissertations--Statistics

Comparing the distribution of biomarker measurements between two groups under either an unpaired or paired design is a common goal in many biomarker studies. However, analyzing biomarker data is sometimes challenging because the data may not be normally distributed and contain a large fraction of zero values or missing values. Although several statistical methods have been proposed, they either require data normality assumption, or are inefficient. We proposed a novel two-part semiparametric method for data under an unpaired setting and a nonparametric method for data under a paired setting. The semiparametric method considers a two-part model, a logistic regression for …


Nonparametric Analysis Of Clustered And Multivariate Data, Yue Cui Jan 2020

Nonparametric Analysis Of Clustered And Multivariate Data, Yue Cui

Theses and Dissertations--Statistics

In this dissertation, we investigate three distinct but interrelated problems for nonparametric analysis of clustered data and multivariate data in pre-post factorial design.

In the first project, we propose a nonparametric approach for one-sample clustered data in pre-post intervention design. In particular, we consider the situation where for some clusters all members are only observed at either pre or post intervention but not both. This type of clustered data is referred to us as partially complete clustered data. Unlike most of its parametric counterparts, we do not assume specific models for data distributions, intra-cluster dependence structure or variability, in effect …


Multiple Imputation Using Influential Exponential Tilting In Case Of Non-Ignorable Missing Data, Kavita Gohil Jan 2020

Multiple Imputation Using Influential Exponential Tilting In Case Of Non-Ignorable Missing Data, Kavita Gohil

Electronic Theses and Dissertations

Modern research strategies rely predominantly on three steps, data collection, data analysis, and inference. In research, if the data is not collected as designed, researchers may face challenges of having incomplete data, especially when it is non-ignorable. These situations affect the subsequent steps of evaluation and make them difficult to perform. Inference with incomplete data is a challenging task in data analysis and clinical trials when missing data related to the condition under the study. Moreover, results obtained from incomplete data are prone to biases. Parameter estimation with non-ignorable missing data is even more challenging to handle and extract useful …


Evaluation Of Modern Missing Data Handling Methods For Coefficient Alpha, Katerina Matysova Dec 2019

Evaluation Of Modern Missing Data Handling Methods For Coefficient Alpha, Katerina Matysova

College of Education and Human Sciences: Dissertations, Theses, and Student Research

When assessing a certain characteristic or trait using a multiple item measure, quality of that measure can be assessed by examining the reliability. To avoid multiple time points, reliability can be represented by internal consistency, which is most commonly calculated using Cronbach’s coefficient alpha. Almost every time human participants are involved in research, there is missing data involved. Missing data means that even though complete data were expected to be collected, some data are missing. Missing data can follow different patterns as well as be the result of different mechanisms. One traditional way to deal with missing data is listwise …


The Estimation Of Missing Values In Rectangular Lattice Designs, Emmanuel Ogochukwu Ossai, Abimibola Victoria Oladugba Sep 2019

The Estimation Of Missing Values In Rectangular Lattice Designs, Emmanuel Ogochukwu Ossai, Abimibola Victoria Oladugba

Journal of Modern Applied Statistical Methods

Algebraic expressions for estimating missing data when one or more observation(s) are missing in Rectangular lattice designs with repetition were derived using the method of minimizing the residual sum of squares. Results showed that the estimated value(s) were significantly approximate to that of the actual value(s).


Exploring The Estimability Of Mark-Recapture Models With Individual, Time-Varying Covariates Using The Scaled Logit Link Function, Jiaqi Mu Aug 2019

Exploring The Estimability Of Mark-Recapture Models With Individual, Time-Varying Covariates Using The Scaled Logit Link Function, Jiaqi Mu

Electronic Thesis and Dissertation Repository

Mark-recapture studies are often used to estimate the survival of individuals in a population and identify factors that affect survival in order to understand how the population might be affected by changing conditions. Factors that vary between individuals and over time, like body mass, present a challenge because they can only be observed when an individual is captured. Several models have been proposed to deal with the missing-covariate problem and commonly impose a logit link function which implies that the survival probability varies between 0 and 1. In this thesis I explore the estimability of four possible models when survival …


Comparison Of Imputation Methods For Mixed Data Missing At Random, Kaitlyn Heidt May 2019

Comparison Of Imputation Methods For Mixed Data Missing At Random, Kaitlyn Heidt

Electronic Theses and Dissertations

A statistician's job is to produce statistical models. When these models are precise and unbiased, we can relate them to new data appropriately. However, when data sets have missing values, assumptions to statistical methods are violated and produce biased results. The statistician's objective is to implement methods that produce unbiased and accurate results. Research in missing data is becoming popular as modern methods that produce unbiased and accurate results are emerging, such as MICE in R, a statistical software. Using real data, we compare four common imputation methods, in the MICE package in R, at different levels of missingness. The …


Fixed Choice Design And Augmented Fixed Choice Design For Network Data With Missing Observations, Miles Q. Ott, Matthew T. Harrison, Krista J. Gile, Nancy P. Barnett, Joseph W. Hogan Jan 2019

Fixed Choice Design And Augmented Fixed Choice Design For Network Data With Missing Observations, Miles Q. Ott, Matthew T. Harrison, Krista J. Gile, Nancy P. Barnett, Joseph W. Hogan

Statistical and Data Sciences: Faculty Publications

The statistical analysis of social networks is increasingly used to understand social processes and patterns. The association between social relationships and individual behaviors is of particular interest to sociologists, psychologists, and public health researchers. Several recent network studies make use of the fixed choice design (FCD), which induces missing edges in the network data. Because of the complex dependence structure inherent in networks, missing data can pose very difficult problems for valid statistical inference. In this article, we introduce novel methods for accounting for the FCD censoring and introduce a new survey design, which we call the augmented fixed choice …


Forecasting Crashes, Credit Card Default, And Imputation Analysis On Missing Values By The Use Of Neural Networks, Jazmin Quezada Jan 2019

Forecasting Crashes, Credit Card Default, And Imputation Analysis On Missing Values By The Use Of Neural Networks, Jazmin Quezada

Open Access Theses & Dissertations

A neural network is a system of hardware and/or software patterned after the operation of neurons in the human brain. Neural networks,- also called Artificial Neural Networks - are a variety of deep learning technology, which also falls under the umbrella of artificial intelligence, or AI. Recent studies shows that Artificial Neural Network has the highest coefficient of determination (i.e. measure to assess how well a model explains and predicts future outcomes.) in comparison to the K-nearest neighbor classifiers, logistic regression, discriminant analysis, naive Bayesian classifier, and classification trees. In this work, the theoretical description of the neural network methodology …


Bayesian Nonparametric Analysis Of Longitudinal Data With Non-Ignorable Non-Monotone Missingness, Yu Cao Jan 2019

Bayesian Nonparametric Analysis Of Longitudinal Data With Non-Ignorable Non-Monotone Missingness, Yu Cao

Theses and Dissertations

In longitudinal studies, outcomes are measured repeatedly over time, but in reality clinical studies are full of missing data points of monotone and non-monotone nature. Often this missingness is related to the unobserved data so that it is non-ignorable. In such context, pattern-mixture model (PMM) is one popular tool to analyze the joint distribution of outcome and missingness patterns. Then the unobserved outcomes are imputed using the distribution of observed outcomes, conditioned on missing patterns. However, the existing methods suffer from model identification issues if data is sparse in specific missing patterns, which is very likely to happen with a …


Statistical Tools For Assessment Of Spatial Properties Of Mutations Observed Under The Microarray Platform, Bin Luo Sep 2018

Statistical Tools For Assessment Of Spatial Properties Of Mutations Observed Under The Microarray Platform, Bin Luo

Electronic Thesis and Dissertation Repository

Mutations are alterations of the DNA nucleotide sequence of the genome. Analyses of spatial properties of mutations are critical for understanding certain mutational mechanisms relevant to genetic disease, diversity, and evolution. The studies in this thesis focus on two types of mutations: point mutations, i.e., single nucleotide polymorphism (SNP) genotype differences, and mutations in segments, i.e., copy number variations (CNVs). The microarray platform, such as the Mouse Diversity Genotyping Array (MDGA), detects these mutations genome-wide with lower cost compared to whole genome sequencing, and thus is considered for suitability as a screening tool for large populations. Yet it provides observation …


Handling Missing Data In Single-Case Studies, Chao-Ying Joanne Peng, Li-Ting Chen Jun 2018

Handling Missing Data In Single-Case Studies, Chao-Ying Joanne Peng, Li-Ting Chen

Journal of Modern Applied Statistical Methods

Multiple imputation is illustrated for dealing with missing data in a published SCED study. Results were compared to those obtained from available data. Merits and issues of implementation are discussed. Recommendations are offered on primal/advanced readings, statistical software, and future research.


Missing Data In Longitudinal Surveys: A Comparison Of Performance Of Modern Techniques, Paola Zaninotto, Amanda Sacker Dec 2017

Missing Data In Longitudinal Surveys: A Comparison Of Performance Of Modern Techniques, Paola Zaninotto, Amanda Sacker

Journal of Modern Applied Statistical Methods

Using a simulation study, the performance of complete case analysis, full information maximum likelihood, multivariate normal imputation, multiple imputation by chained equations and two-fold fully conditional specification to handle missing data were compared in longitudinal surveys with continuous and binary outcomes, missing covariates, and an interaction term.


Impact Of Home Visit Capacity On Genetic Association Studies Of Late-Onset Alzheimer's Disease, David W. Fardo, Laura E. Gibbons, Shubhabrata Mukherjee, M. Maria Glymour, Wayne Mccormick, Susan M. Mccurry, James D. Bowen, Eric B. Larson, Paul K. Crane Aug 2017

Impact Of Home Visit Capacity On Genetic Association Studies Of Late-Onset Alzheimer's Disease, David W. Fardo, Laura E. Gibbons, Shubhabrata Mukherjee, M. Maria Glymour, Wayne Mccormick, Susan M. Mccurry, James D. Bowen, Eric B. Larson, Paul K. Crane

Biostatistics Faculty Publications

INTRODUCTION—Findings for genetic correlates of late-onset Alzheimer's disease (LOAD) in studies that rely solely on clinic visits may differ from those with capacity to follow participants unable to attend clinic visits.

METHODS—We evaluated previously identified LOAD-risk single nucleotide variants in the prospective Adult Changes in Thought study, comparing hazard ratios (HRs) estimated using the full data set of both in-home and clinic visits (n = 1697) to HRs estimated using only data that were obtained from clinic visits (n = 1308). Models were adjusted for age, sex, principal components to account for ancestry, and additional health indicators.

RESULTS …


Jmasm44: Implementing Multiple Ratio Imputation By The Emb Algorithm (R), Masayoshi Takahashi May 2017

Jmasm44: Implementing Multiple Ratio Imputation By The Emb Algorithm (R), Masayoshi Takahashi

Journal of Modern Applied Statistical Methods

Although single ratio imputation is often used to deal with missing values in practice, there is a paucity of discussion regarding multiple ratio imputation. Code in the R statistical environment is presented to execute multiple ratio imputation by the Expectation-Maximization with Bootstrapping (EMB) algorithm.


Multiple Ratio Imputation By The Emb Algorithm: Theory And Simulation, Masayoshi Takahashi May 2017

Multiple Ratio Imputation By The Emb Algorithm: Theory And Simulation, Masayoshi Takahashi

Journal of Modern Applied Statistical Methods

Although multiple imputation is the gold standard of treating missing data, single ratio imputation is often used in practice. Based on Monte Carlo simulation, the Expectation-Maximization with Bootstrapping (EMB) algorithm to create multiple ratio imputation is used to fill in the gap between theory and practice.


Multiple Imputation Of Missing Data In Structural Equation Models With Mediators And Moderators Using Gradient Boosted Machine Learning, Robert J. Milletich Ii Oct 2016

Multiple Imputation Of Missing Data In Structural Equation Models With Mediators And Moderators Using Gradient Boosted Machine Learning, Robert J. Milletich Ii

Psychology Theses & Dissertations

Mediation and moderated mediation models are two commonly used models for indirect effects analysis. In practice, missing data is a pervasive problem in structural equation modeling with psychological data. Multiple imputation (MI) is one method used to estimate model parameters in the presence of missing data, while accounting for uncertainty due to the missing data. Unfortunately, commonly used MI methods are not equipped to handle categorical variables or nonlinear variables such as interactions. In this study, we introduce a general MI framework that uses the Bayesian bootstrap (BB) method to generate posterior inferences for indirect effects and gradient boosted machine …


Crtgeedr: An R Package For Doubly Robust Generalized Estimating Equations Estimations In Cluster Randomized Trials With Missing Data, Melanie Prague, Rui Wang, Victor De Gruttola Feb 2016

Crtgeedr: An R Package For Doubly Robust Generalized Estimating Equations Estimations In Cluster Randomized Trials With Missing Data, Melanie Prague, Rui Wang, Victor De Gruttola

Harvard University Biostatistics Working Paper Series

No abstract provided.


Correction Of Verication Bias Using Log-Linear Models For A Single Binaryscale Diagnostic Tests, Haresh Rochani, Hani M. Samawi, Robert L. Vogel, Jingjing Yin Dec 2015

Correction Of Verication Bias Using Log-Linear Models For A Single Binaryscale Diagnostic Tests, Haresh Rochani, Hani M. Samawi, Robert L. Vogel, Jingjing Yin

Biostatistics Faculty Publications

In diagnostic medicine, the test that determines the true disease status without an error is referred to as the gold standard. Even when a gold standard exists, it is extremely difficult to verify each patient due to the issues of costeffectiveness and invasive nature of the procedures. In practice some of the patients with test results are not selected for verification of the disease status which results in verification bias for diagnostic tests. The ability of the diagnostic test to correctly identify the patients with and without the disease can be evaluated by measures such as sensitivity, specificity and predictive …


The Effects Of A Planned Missingness Design On Examinee Motivation And Psychometric Quality, Matthew S. Swain May 2015

The Effects Of A Planned Missingness Design On Examinee Motivation And Psychometric Quality, Matthew S. Swain

Dissertations, 2014-2019

Assessment practitioners in higher education face increasing demands to collect assessment and accountability data to make important inferences about student learning and institutional quality. The validity of these high-stakes decisions is jeopardized, particularly in low-stakes testing contexts, when examinees do not expend sufficient motivation to perform well on the test. This study introduced planned missingness as a potential solution. In planned missingness designs, data on all items are collected but each examinee only completes a subset of items, thus increasing data collection efficiency, reducing examinee burden, and potentially increasing data quality. The current scientific reasoning test served as the Long …


Integrating Data Transformation In Principal Components Analysis, Mehdi Maadooliat, Jianhua Z. Huang, Jianhua Hu Mar 2015

Integrating Data Transformation In Principal Components Analysis, Mehdi Maadooliat, Jianhua Z. Huang, Jianhua Hu

Mathematics, Statistics and Computer Science Faculty Research and Publications

Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated …


Some General Guidelines For Choosing Missing Data Handling Methods In Educational Research, Jehanzeb R. Cheema Nov 2014

Some General Guidelines For Choosing Missing Data Handling Methods In Educational Research, Jehanzeb R. Cheema

Journal of Modern Applied Statistical Methods

The effect of a number of factors, such as the choice of analytical method, the handling method for missing data, sample size, and proportion of missing data, were examined to evaluate the effect of missing data treatment on accuracy of estimation. A methodological approach involving simulated data was adopted. One outcome of the statistical analyses undertaken in this study is the formulation of easy-to-implement guidelines for educational researchers that allows one to choose one of the following factors when all others are given: sample size, proportion of missing data in the sample, method of analysis, and missing data handling method.


Phylogenetic Linkage Among Hiv-Infected Village Residents In Botswana: Estimation Of Clustering Rates In The Presence Of Missing Data, Nicole Bohme Carnegie, Rui Wang, Vladimir Novitsky, Victor G. Degruttola Jun 2013

Phylogenetic Linkage Among Hiv-Infected Village Residents In Botswana: Estimation Of Clustering Rates In The Presence Of Missing Data, Nicole Bohme Carnegie, Rui Wang, Vladimir Novitsky, Victor G. Degruttola

Harvard University Biostatistics Working Paper Series

No abstract provided.


Jmasm 32: Multiple Imputation Of Missing Multilevel, Longitudinal Data: A Case When Practical Considerations Trump Best Practices?, Jennifer E. V. Lloyd, Jelena Obradović, Richard M. Carpiano, Frosso Motti-Stefanidi May 2013

Jmasm 32: Multiple Imputation Of Missing Multilevel, Longitudinal Data: A Case When Practical Considerations Trump Best Practices?, Jennifer E. V. Lloyd, Jelena Obradović, Richard M. Carpiano, Frosso Motti-Stefanidi

Journal of Modern Applied Statistical Methods

A pedagogical tool is presented for applied researchers dealing with incomplete multilevel, longitudinal data. It explains why such data pose special challenges regarding missingness. Syntax created to perform a multiply-imputed growth modeling procedure in Stata Version 11 (StataCorp, 2009) is also described.