Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 19 of 19

Full-Text Articles in Physical Sciences and Mathematics

Statistical Challenges And Methods For Missing And Imbalanced Data, Rose Adjei Dec 2022

Statistical Challenges And Methods For Missing And Imbalanced Data, Rose Adjei

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of …


Performance Comparison Of Imputation Methods For Mixed Data Missing At Random With Small And Large Sample Data Set With Different Variability, Kyei Afari Aug 2021

Performance Comparison Of Imputation Methods For Mixed Data Missing At Random With Small And Large Sample Data Set With Different Variability, Kyei Afari

Electronic Theses and Dissertations

One of the concerns in the field of statistics is the presence of missing data, which leads to bias in parameter estimation and inaccurate results. However, the multiple imputation procedure is a remedy for handling missing data. This study looked at the best multiple imputation methods used to handle mixed variable datasets with different sample sizes and variability along with different levels of missingness. The study employed the predictive mean matching, classification and regression trees, and the random forest imputation methods. For each dataset, the multiple regression parameter estimates for the complete datasets were compared to the multiple regression parameter …


Conditional And Marginal Imputation Models For Multilevel Data, Gang Liu Aug 2021

Conditional And Marginal Imputation Models For Multilevel Data, Gang Liu

Legacy Theses & Dissertations (2009 - 2024)

This dissertation study extends sequential hierarchical regression imputation (SHRIMP) methods to multilevel datasets with three levels of nesting and proposes a marginal method based on marginalized multilevel model (MMM) framework. Specifically, the proposed model consists of two levels such that the first level relates the marginal mean of responses with covariates through a generalized regression model and the second level includes subject specific random effects within the same generalized regression model. To draw the inference on the population-averaged or subject-specified coefficients, the hierarchical regression and/or MMM is applied as the imputation and estimation models. We employ Markov Chain Monte Carlo …


Performance Comparison Of Multiple Imputation Methods For Quantitative Variables For Small And Large Data With Differing Variability, Vincent Onyame May 2021

Performance Comparison Of Multiple Imputation Methods For Quantitative Variables For Small And Large Data With Differing Variability, Vincent Onyame

Electronic Theses and Dissertations

Missing data continues to be one of the main problems in data analysis as it reduces sample representativeness and consequently, causes biased estimates. Multiple imputation methods have been established as an effective method of handling missing data. In this study, we examined multiple imputation methods for quantitative variables on twelve data sets with varied sizes and variability that were pseudo generated from an original data. The multiple imputation methods examined are the predictive mean matching, Bayesian linear regression and linear regression, non-Bayesian in the MICE (Multiple Imputation Chain Equation) package in the statistical software, R. The parameter estimates generated from …


Imputation, Modelling And Optimal Sampling Design For Digital Camera Data In Recreational Fisheries Monitoring, Ebenezer Afrifa-Yamoah Jan 2021

Imputation, Modelling And Optimal Sampling Design For Digital Camera Data In Recreational Fisheries Monitoring, Ebenezer Afrifa-Yamoah

Theses: Doctorates and Masters

Digital camera monitoring has evolved as an active application-oriented scheme to help address questions in areas such as fisheries, ecology, computer vision, artificial intelligence, and criminology. In recreational fisheries research, digital camera monitoring has become a viable option for probability-based survey methods, and is also used for corroborative and validation purposes. In comparison to onsite surveys (e.g. boat ramp surveys), digital cameras provide a cost-effective method of monitoring boating activity and fishing effort, including night-time fishing activities. However, there are challenges in the use of digital camera monitoring that need to be resolved. Notably, missing data problems and the cost …


Multiple Imputation Using Influential Exponential Tilting In Case Of Non-Ignorable Missing Data, Kavita Gohil Jan 2020

Multiple Imputation Using Influential Exponential Tilting In Case Of Non-Ignorable Missing Data, Kavita Gohil

Electronic Theses and Dissertations

Modern research strategies rely predominantly on three steps, data collection, data analysis, and inference. In research, if the data is not collected as designed, researchers may face challenges of having incomplete data, especially when it is non-ignorable. These situations affect the subsequent steps of evaluation and make them difficult to perform. Inference with incomplete data is a challenging task in data analysis and clinical trials when missing data related to the condition under the study. Moreover, results obtained from incomplete data are prone to biases. Parameter estimation with non-ignorable missing data is even more challenging to handle and extract useful …


Nonparametric Analysis Of Clustered And Multivariate Data, Yue Cui Jan 2020

Nonparametric Analysis Of Clustered And Multivariate Data, Yue Cui

Theses and Dissertations--Statistics

In this dissertation, we investigate three distinct but interrelated problems for nonparametric analysis of clustered data and multivariate data in pre-post factorial design.

In the first project, we propose a nonparametric approach for one-sample clustered data in pre-post intervention design. In particular, we consider the situation where for some clusters all members are only observed at either pre or post intervention but not both. This type of clustered data is referred to us as partially complete clustered data. Unlike most of its parametric counterparts, we do not assume specific models for data distributions, intra-cluster dependence structure or variability, in effect …


Semiparametric And Nonparametric Methods For Comparing Biomarker Levels Between Groups, Yuntong Li Jan 2020

Semiparametric And Nonparametric Methods For Comparing Biomarker Levels Between Groups, Yuntong Li

Theses and Dissertations--Statistics

Comparing the distribution of biomarker measurements between two groups under either an unpaired or paired design is a common goal in many biomarker studies. However, analyzing biomarker data is sometimes challenging because the data may not be normally distributed and contain a large fraction of zero values or missing values. Although several statistical methods have been proposed, they either require data normality assumption, or are inefficient. We proposed a novel two-part semiparametric method for data under an unpaired setting and a nonparametric method for data under a paired setting. The semiparametric method considers a two-part model, a logistic regression for …


Exploring The Estimability Of Mark-Recapture Models With Individual, Time-Varying Covariates Using The Scaled Logit Link Function, Jiaqi Mu Aug 2019

Exploring The Estimability Of Mark-Recapture Models With Individual, Time-Varying Covariates Using The Scaled Logit Link Function, Jiaqi Mu

Electronic Thesis and Dissertation Repository

Mark-recapture studies are often used to estimate the survival of individuals in a population and identify factors that affect survival in order to understand how the population might be affected by changing conditions. Factors that vary between individuals and over time, like body mass, present a challenge because they can only be observed when an individual is captured. Several models have been proposed to deal with the missing-covariate problem and commonly impose a logit link function which implies that the survival probability varies between 0 and 1. In this thesis I explore the estimability of four possible models when survival …


Comparison Of Imputation Methods For Mixed Data Missing At Random, Kaitlyn Heidt May 2019

Comparison Of Imputation Methods For Mixed Data Missing At Random, Kaitlyn Heidt

Electronic Theses and Dissertations

A statistician's job is to produce statistical models. When these models are precise and unbiased, we can relate them to new data appropriately. However, when data sets have missing values, assumptions to statistical methods are violated and produce biased results. The statistician's objective is to implement methods that produce unbiased and accurate results. Research in missing data is becoming popular as modern methods that produce unbiased and accurate results are emerging, such as MICE in R, a statistical software. Using real data, we compare four common imputation methods, in the MICE package in R, at different levels of missingness. The …


Forecasting Crashes, Credit Card Default, And Imputation Analysis On Missing Values By The Use Of Neural Networks, Jazmin Quezada Jan 2019

Forecasting Crashes, Credit Card Default, And Imputation Analysis On Missing Values By The Use Of Neural Networks, Jazmin Quezada

Open Access Theses & Dissertations

A neural network is a system of hardware and/or software patterned after the operation of neurons in the human brain. Neural networks,- also called Artificial Neural Networks - are a variety of deep learning technology, which also falls under the umbrella of artificial intelligence, or AI. Recent studies shows that Artificial Neural Network has the highest coefficient of determination (i.e. measure to assess how well a model explains and predicts future outcomes.) in comparison to the K-nearest neighbor classifiers, logistic regression, discriminant analysis, naive Bayesian classifier, and classification trees. In this work, the theoretical description of the neural network methodology …


Bayesian Nonparametric Analysis Of Longitudinal Data With Non-Ignorable Non-Monotone Missingness, Yu Cao Jan 2019

Bayesian Nonparametric Analysis Of Longitudinal Data With Non-Ignorable Non-Monotone Missingness, Yu Cao

Theses and Dissertations

In longitudinal studies, outcomes are measured repeatedly over time, but in reality clinical studies are full of missing data points of monotone and non-monotone nature. Often this missingness is related to the unobserved data so that it is non-ignorable. In such context, pattern-mixture model (PMM) is one popular tool to analyze the joint distribution of outcome and missingness patterns. Then the unobserved outcomes are imputed using the distribution of observed outcomes, conditioned on missing patterns. However, the existing methods suffer from model identification issues if data is sparse in specific missing patterns, which is very likely to happen with a …


Statistical Tools For Assessment Of Spatial Properties Of Mutations Observed Under The Microarray Platform, Bin Luo Sep 2018

Statistical Tools For Assessment Of Spatial Properties Of Mutations Observed Under The Microarray Platform, Bin Luo

Electronic Thesis and Dissertation Repository

Mutations are alterations of the DNA nucleotide sequence of the genome. Analyses of spatial properties of mutations are critical for understanding certain mutational mechanisms relevant to genetic disease, diversity, and evolution. The studies in this thesis focus on two types of mutations: point mutations, i.e., single nucleotide polymorphism (SNP) genotype differences, and mutations in segments, i.e., copy number variations (CNVs). The microarray platform, such as the Mouse Diversity Genotyping Array (MDGA), detects these mutations genome-wide with lower cost compared to whole genome sequencing, and thus is considered for suitability as a screening tool for large populations. Yet it provides observation …


Multiple Imputation Of Missing Data In Structural Equation Models With Mediators And Moderators Using Gradient Boosted Machine Learning, Robert J. Milletich Ii Oct 2016

Multiple Imputation Of Missing Data In Structural Equation Models With Mediators And Moderators Using Gradient Boosted Machine Learning, Robert J. Milletich Ii

Psychology Theses & Dissertations

Mediation and moderated mediation models are two commonly used models for indirect effects analysis. In practice, missing data is a pervasive problem in structural equation modeling with psychological data. Multiple imputation (MI) is one method used to estimate model parameters in the presence of missing data, while accounting for uncertainty due to the missing data. Unfortunately, commonly used MI methods are not equipped to handle categorical variables or nonlinear variables such as interactions. In this study, we introduce a general MI framework that uses the Bayesian bootstrap (BB) method to generate posterior inferences for indirect effects and gradient boosted machine …


The Effects Of A Planned Missingness Design On Examinee Motivation And Psychometric Quality, Matthew S. Swain May 2015

The Effects Of A Planned Missingness Design On Examinee Motivation And Psychometric Quality, Matthew S. Swain

Dissertations, 2014-2019

Assessment practitioners in higher education face increasing demands to collect assessment and accountability data to make important inferences about student learning and institutional quality. The validity of these high-stakes decisions is jeopardized, particularly in low-stakes testing contexts, when examinees do not expend sufficient motivation to perform well on the test. This study introduced planned missingness as a potential solution. In planned missingness designs, data on all items are collected but each examinee only completes a subset of items, thus increasing data collection efficiency, reducing examinee burden, and potentially increasing data quality. The current scientific reasoning test served as the Long …


A Comparison For Longitudinal Data Missing Due To Truncation, Rong Liu Jan 2006

A Comparison For Longitudinal Data Missing Due To Truncation, Rong Liu

Theses and Dissertations

Many longitudinal clinical studies suffer from patient dropout. Often the dropout is nonignorable and the missing mechanism needs to be incorporated in the analysis. The methods handling missing data make various assumptions about the missing mechanism, and their utility in practice depends on whether these assumptions apply in a specific application. Ramakrishnan and Wang (2005) proposed a method (MDT) to handle nonignorable missing data, where missing is due to the observations exceeding an unobserved threshold. Assuming that the observations arise from a truncated normal distribution, they suggested an EM algorithm to simplify the estimation.In this dissertation the EM algorithm is …


Autologous Stem Cell Transplant: Factors Predicting The Yield Of Cd34+ Cells, Elizabeth Anne Lawson Dec 2005

Autologous Stem Cell Transplant: Factors Predicting The Yield Of Cd34+ Cells, Elizabeth Anne Lawson

Theses and Dissertations

Stem cell transplant is often considered the last hope for the survival for many cancer patients. The CD34+ cell content of a collection of stem cells has appeared as the most reliable indicator of the quantity of desired cells in a peripheral blood stem cell harvest and is used as a surrogate measure of the sample quality. Factors predicting the yield of CD34+ cells in a collection are not yet fully understood. Throughout the literature, there has been conflicting evidence with regards to age, gender, disease status, and prior radiation. In addition to the factors that have already been explored, …


Program For Missing Data In The Multivariate Normal Distribution, Chi-Ping Lu Jan 1975

Program For Missing Data In The Multivariate Normal Distribution, Chi-Ping Lu

All Graduate Plan B and other Reports, Spring 1920 to Spring 2023

Missing data can often cause many problems in research work. Therefore for carrying out analysis, some procedure for obtaining estimates in the presence of missing data should be applied. Various theories and techniques have been developed for different types of problems.

Analysis of the Multivariate Normal Distribution with missing data is one of the areas studied. It has been discussed earlier by Wilkes (1932), Lord (1955), Edgett (1956) and Hartley (1958). They have established some basic concepts and an outline in the way of estimation.

In the last ten years, A. A. Afifi and R. M. Elasfoff also have contributed …


The Evaluation Of Glasser's Maximum Likelihood Method On Missing Data In Regression, Gayle M. Yamasaki Jan 1973

The Evaluation Of Glasser's Maximum Likelihood Method On Missing Data In Regression, Gayle M. Yamasaki

All Graduate Plan B and other Reports, Spring 1920 to Spring 2023

Missing data in regression is often a problem to research workers because standard regression methods are applicable only to complete data sets. At present there are three general methods for solving the problem of missing data.

At first, the reduced data method, reduces the incomplete data set to a complete data set before analyzing. Although this method is very simple to apply, substantial amounts of information are sometimes lost when data is eliminated. This results in less precise estimates of the regression parameters.

The second method, generalized least squares, estimates the missing values through least squares techniques, thus obtaining a …