Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 38

Full-Text Articles in Entire DC Network

Interpretable Word-Level Sentiment Analysis With Attention-Based Multiple Instance Classification Models, Chenyu Yang Dec 2023

Interpretable Word-Level Sentiment Analysis With Attention-Based Multiple Instance Classification Models, Chenyu Yang

Statistical Science Theses and Dissertations

In this study, our main objective is to tackle the black-box nature of popular machine learning models in sentiment analysis and enhance model interpretability. We aim to gain more insight into the decision-making process of sentiment analysis models, which is often obscure in those complex models. To achieve this goal, we introduce two word-level sentiment analysis models.

The first model is called the attention-based multiple instance classification (AMIC) model. It combines the transparent model structure of multiple instance classification and the self-attention mechanism in deep learning to incorporate the contextual information from documents. As demonstrated by a wine review dataset …


Bayesian Statistical Modeling Of Spatially Resolved Transcriptomics Data, Xi Jiang Oct 2023

Bayesian Statistical Modeling Of Spatially Resolved Transcriptomics Data, Xi Jiang

Statistical Science Theses and Dissertations

Spatially resolved transcriptomics (SRT) quantifies expression levels at different spatial locations, providing a new and powerful tool to investigate novel biological insights. As experimental technologies enhance both in capacity and efficiency, there arises a growing demand for the development of analytical methodologies.

One question in SRT data analysis is to identify genes whose expressions exhibit spatially correlated patterns, called spatially variable (SV) genes. Most current methods to identify SV genes are built upon the geostatistical model with Gaussian process, which could limit the models' ability to identify complex spatial patterns. In order to overcome this challenge and capture more types …


A Comparison Of Confidence Intervals In State Space Models, Jinyu Du Jul 2023

A Comparison Of Confidence Intervals In State Space Models, Jinyu Du

Statistical Science Theses and Dissertations

This thesis develops general procedures for constructing confidence intervals (CIs) of the error disturbance parameters (standard deviations) and transformations of the error disturbance parameters in time-invariant state space models (ssm). With only a set of observations, estimating individual error disturbance parameters accurately in the presence of other unknown parameters in ssm is a very challenging problem. We attempted to construct four different types of confidence intervals, Wald, likelihood ratio, score, and higher-order asymptotic intervals for both the simple local level model and the general time-invariant state space models (ssm). We show that for a simple local level model, both the …


Optimal Experimental Planning Of Reliability Experiments Based On Coherent Systems, Yang Yu Jul 2023

Optimal Experimental Planning Of Reliability Experiments Based On Coherent Systems, Yang Yu

Statistical Science Theses and Dissertations

In industrial engineering and manufacturing, assessing the reliability of a product or system is an important topic. Life-testing and reliability experiments are commonly used reliability assessment methods to gain sound knowledge about product or system lifetime distributions. Usually, a sample of items of interest is subjected to stresses and environmental conditions that characterize the normal operating conditions. During the life-test, successive times to failure are recorded and lifetime data are collected. Life-testing is useful in many industrial environments, including the automobile, materials, telecommunications, and electronics industries.

There are different kinds of life-testing experiments that can be applied for different purposes. …


Contributions To Causal Inference In Observational Studies, Jenny Park, Daniel F. Heitjan, Christy Boling Turer May 2023

Contributions To Causal Inference In Observational Studies, Jenny Park, Daniel F. Heitjan, Christy Boling Turer

Statistical Science Theses and Dissertations

The electronic health record (EHR) is a digital version of the patient chart. All clinically relevant patient information can be accessed from the EHR by professionals involved in the patient’s care. For researchers, the EHR is a rich, convenient source for data to address a vast range of medical research questions.

In observational studies with EHR data, it is common to define the treatment/exposure status as a binary indicator reflecting whether patient was documented to receive a particular medication or procedure. The outcome can be any type of information on patient status documented in the EHR after the treatment has …


Empirical Likelihood Ratio Tests For Homogeneity Of Distributions Of Component Lifetimes From System Lifetime Data With Known System Structures, Jingjing Qu May 2023

Empirical Likelihood Ratio Tests For Homogeneity Of Distributions Of Component Lifetimes From System Lifetime Data With Known System Structures, Jingjing Qu

Statistical Science Theses and Dissertations

In system reliability, practitioners may be interested in testing the homogeneity of the component lifetime distributions based on system lifetimes from multiple data sources for various reasons, such as identifying the component supplier that provides the most reliable components.

In the first part of the dissertation, we develop distribution-free hypothesis testing procedures for the homogeneity of the component lifetime distributions based on system lifetime data when the system structures are known. Several nonparametric testing statistics based on the empirical likelihood method are proposed for testing the homogeneity of two or more component lifetime distributions. The computational approaches to obtain the …


Development Of Bayesian Hierarchical Methods Involving Meta-Analysis, Jackson Barth May 2023

Development Of Bayesian Hierarchical Methods Involving Meta-Analysis, Jackson Barth

Statistical Science Theses and Dissertations

When conducting statistical analysis in the Bayesian paradigm, the most critical decision made by the researcher is the identification of a prior distribution for a parameter. Despite the mathematical soundness of the Bayesian approach, a wrongly specified prior can lead to biased and incorrect results. To avoid this, prior distributions should be based on real data, which are easily accessible in the "big data" era. This dissertation explores two applications of Bayesian hierarchical modelling that incorporate information obtained from a meta-analysis.

The first of these applications is in the normalization of genomics data, specifically for nanostring nCounter datasets. A meta-analysis …


Optimizing Tumor Xenograft Experiments Using Bayesian Linear And Nonlinear Mixed Modelling And Reinforcement Learning, Mary Lena Bleile May 2023

Optimizing Tumor Xenograft Experiments Using Bayesian Linear And Nonlinear Mixed Modelling And Reinforcement Learning, Mary Lena Bleile

Statistical Science Theses and Dissertations

Tumor xenograft experiments are a popular tool of cancer biology research. In a typical such experiment, one implants a set of animals with an aliquot of the human tumor of interest, applies various treatments of interest, and observes the subsequent response. Efficient analysis of the data from these experiments is therefore of utmost importance. This dissertation proposes three methods for optimizing cancer treatment and data analysis in the tumor xenograft context. The first of these is applicable to tumor xenograft experiments in general, and the second two seek to optimize the combination of radiotherapy with immunotherapy in the tumor xenograft …


Influence Diagnostics For Generalized Estimating Equations Applied To Correlated Categorical Data, Louis Vazquez Apr 2023

Influence Diagnostics For Generalized Estimating Equations Applied To Correlated Categorical Data, Louis Vazquez

Statistical Science Theses and Dissertations

Influence diagnostics in regression analysis allow analysts to identify observations that have a strong influence on model fitted probabilities and parameter estimates. The most common influence diagnostics, such as Cook’s Distance for linear regression, are based on a deletion approach where the results of a model with and without observations of interest are compared. Here, deletion-based influence diagnostics are proposed for generalized estimating equations (GEE) for correlated, or clustered, nominal multinomial responses. The proposed influence diagnostics focus on GEEs with the baseline-category logit link function and a local odds ratio parameterization of the association structure. Formulas for both observation- and …


Bayesian Methods For Random-Effects Meta-Analysis Of Rare Binary Events In Biomedical Research, Ming Zhang Apr 2023

Bayesian Methods For Random-Effects Meta-Analysis Of Rare Binary Events In Biomedical Research, Ming Zhang

Statistical Science Theses and Dissertations

Rare binary events data arise frequently in medical research. Due to lack of statistical power in individual studies involving such data, meta-analysis has become an increasingly important tool for combining results from multiple independent studies. However, traditional meta-analysis methods often report severely biased estimates in such rare-event settings. Moreover, many rely on models assuming a pre-specified direction for variability between control and treatment groups for mathematical convenience, which may be violated in practice. In Chapter 1, based on a flexible random-effects model that removes the assumption about the direction, we propose new Bayesian procedures for estimating and testing the overall …


Regression Modeling Of Complex Survival Data Based On Pseudo-Observations, Rong Rong Dec 2022

Regression Modeling Of Complex Survival Data Based On Pseudo-Observations, Rong Rong

Statistical Science Theses and Dissertations

The restricted mean survival time (RMST) is a clinically meaningful summary measure in studies with survival outcomes. Statistical methods have been developed for regression analysis of RMST to investigate impacts of covariates on RMST, which is a useful alternative to the Cox regression analysis. However, existing methods for regression modeling of RMST are not applicable to left-truncated right-censored data that arise frequently in prevalent cohort studies, for which the sampling bias due to left truncation and informative censoring induced by the prevalent sampling scheme must be properly addressed. Meanwhile, statistical methods have been developed for regression modeling of the cumulative …


Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun Aug 2022

Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun

Statistical Science Theses and Dissertations

Alternating recurrent events data arise commonly in health research; examples include hospital admissions and discharges of diabetes patients; exacerbations and remissions of chronic bronchitis; and quitting and restarting smoking. Recent work has involved formulating and estimating joint models for the recurrent event times considering non-negligible event durations. However, prediction models for transition between recurrent events are lacking. We consider the development and evaluation of methods for predicting future events within these models. Specifically, we propose a tool for dynamically predicting transition between alternating recurrent events in real time. Under a flexible joint frailty model, we derive the predictive probability of …


Compositional Datasets And The Nested Dirichlet Distribution, Bianca Luedeker Jan 2022

Compositional Datasets And The Nested Dirichlet Distribution, Bianca Luedeker

Statistical Science Theses and Dissertations

Compositional data is a type of multivariate data where each component of a vector is sandwiched between 0 and 1 and the sum of the components is 1. For example, the proportion of time that each of 7 mice spend in one of four quadrants of a circular water maze is between 0 and 1, and the total proportion of time spent in the maze is 1. If there are two sets of mice, one set of normal mice and one set of cognitively impaired mice, the experiment has a two-sample design. Such data is frequently analyzed incorrectly by comparing …


Differential Methods In Modern Biological Data Analysis, Micah Thornton Dec 2021

Differential Methods In Modern Biological Data Analysis, Micah Thornton

Statistical Science Theses and Dissertations

Analysis of biological data for differentiation of organisms/cells within and across species or even the same organism is important to a wide variety of applications. This work considers three different biological data sets at the genome, proteome, and epigenome levels: respectively, DNA sequences, glycosalation data, and DNA methylation. We explore some statistical modeling approaches for handling these modern datasets, and provide a relevant set of experiments for explanation and illustration.

First, genomic Fourier coefficients, which capture information about the harmonics of genetic sequences in terms of nucleotide pattern recurrence are investigated as summary metrics for medium sized virus genomes from …


Exact Inference For Meta-Analysis Of Rare Events And Its Application In Human Genetics, Yanqiu Shao Dec 2021

Exact Inference For Meta-Analysis Of Rare Events And Its Application In Human Genetics, Yanqiu Shao

Statistical Science Theses and Dissertations

Meta-analysis is a statistical approach that integrates data from multiple studies. By aggregating information, it enhances the power to detect the effects of interest and provides an estimate of the effect size with both accuracy and precision. Both fixed-effect and random-effect models are developed and widely used in biomedical research including clinical trials and genomic studies. In the case of rare events data, conventional meta-analysis methods that rely on large sample approximation may not be able to make reliable inferences. There have been various approaches proposed to deal with this situation, in particular, rare binary adverse events in clinical studies. …


Ultra-High Dimensional Bayesian Variable Selection With Lasso-Type Priors, Can Xu Oct 2021

Ultra-High Dimensional Bayesian Variable Selection With Lasso-Type Priors, Can Xu

Statistical Science Theses and Dissertations

With the rapid development of new data collection and acquisition techniques, high-dimensional data have emerged from various fields. Consequentially, new variable selection methods especially in ultra-high dimensional problems are demanding.

The first part of this dissertation focuses on developing a new Bayesian variable selection method for a differential expression analysis using raw NanoString nCounter data. The medium-throughput mRNA abundance platform NanoString nCounter has gained great popularity in the past decade, due to its high sensitivity and technical reproducibility as well as remarkable applicability to ubiquitous formalin fixed paraffin embedded (FFPE) tissue samples. Based on RCRnorm developed for normalizing NanoString nCounter …


Modified Degradation Process Models And Statistical Methods For Assessing Robustness And Reliability Of Complex Networks, Yuzhou Chen Aug 2021

Modified Degradation Process Models And Statistical Methods For Assessing Robustness And Reliability Of Complex Networks, Yuzhou Chen

Statistical Science Theses and Dissertations

In this thesis, we develop a novel stochastic modeling approach based on multiple interdependent topological measures of complex networks. The key engine behind our approach is to evaluate the dynamics of multiple network motifs as descriptors of the underlying network topology. Under a framework of the gamma degradation model, we develop a formal statistical framework for the analysis of reliability and robustness of a single complex network as well as for assessing differences in reliability properties exhibited by two different networks. We validate the proposed methodology with Monte Carlo simulation studies and illustrate the utility of the proposed approach by …


Bayesian Statistical Modeling Of Metagenomics Sequencing Data, Shuang Jiang Aug 2021

Bayesian Statistical Modeling Of Metagenomics Sequencing Data, Shuang Jiang

Statistical Science Theses and Dissertations

Microbiome count data are high-dimensional and usually suffer from uneven sampling depth, over-dispersion, and zero-inflation. In this thesis, we develop specialized analytical models for analyzing such count data. In Chapter 2, I develop a bi-level Bayesian hierarchical framework for microbiome differential abundance analysis. The bottom level is a multivariate count-generating process that links the observed counts to their latent normalized abundances. The top level is a mixture of Gaussian distributions with a feature selection scheme for differential abundance analysis. A simulation study on both simulated and synthetic data is conducted. A colorectal cancer case study demonstrates that a resulting diagnostic …


Estimation Of Parameters Of Gamma And Generalized Gamma Distributions Based On Censored Experimental Data, Xiangwen Shang Aug 2021

Estimation Of Parameters Of Gamma And Generalized Gamma Distributions Based On Censored Experimental Data, Xiangwen Shang

Statistical Science Theses and Dissertations

In time-to-event data analysis, censoring is one of the unique features that restricts our ability to observe the time-to-events and poses difficulties for statistical analysis. Censoring occurs when the exact time-to-event cannot be observed for some or all observations. In this thesis, we study the parameter estimation methods for a two-parameter gamma distribution and a three-parameter generalized gamma distribution based on different kinds of censored data arising from life-testing experiments.

We first study the parameter estimation of a three-parameter generalized gamma distribution based on left-truncated and right-censored data. It is well known that the maximum likelihood estimates of the parameters …


Bayesian Semi-Supervised Keyphrase Extraction And Jackknife Empirical Likelihood For Assessing Heterogeneity In Meta-Analysis, Guanshen Wang Dec 2020

Bayesian Semi-Supervised Keyphrase Extraction And Jackknife Empirical Likelihood For Assessing Heterogeneity In Meta-Analysis, Guanshen Wang

Statistical Science Theses and Dissertations

This dissertation investigates: (1) A Bayesian Semi-supervised Approach to Keyphrase Extraction with Only Positive and Unlabeled Data, (2) Jackknife Empirical Likelihood Confidence Intervals for Assessing Heterogeneity in Meta-analysis of Rare Binary Events.

In the big data era, people are blessed with a huge amount of information. However, the availability of information may also pose great challenges. One big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyphrase extraction methods summarize an article by identifying a list of keyphrases. Many existing keyphrase extraction methods focus on the unsupervised setting, …


Examining Multiple Imputation For Measurement Error Correction In Count Data With Excess Zeros, Shalima Zalsha Dec 2020

Examining Multiple Imputation For Measurement Error Correction In Count Data With Excess Zeros, Shalima Zalsha

Statistical Science Theses and Dissertations

Measurement error and missing data are two common problems in wildlife population surveys. These data are collected from the environment and may be missing or measured with error when the observer’s ability to see the animal is obscured. Methods such as video transects for estimating red snapper abundance and aerial surveys for estimating moose population sizes are highly affected by these problems since total abundance will be underestimated if missing/mismeasured counts are ignored. We shall refer to this problem as visibility bias; it occurs when the true counts are observed when visibility is high, partially observed when visibility is low …


Integrating Different Data Sources For Estimation Of Total With Unknown Population Size, Zhaoce Liu Dec 2020

Integrating Different Data Sources For Estimation Of Total With Unknown Population Size, Zhaoce Liu

Statistical Science Theses and Dissertations

Probability sampling has served as the gold-standard in survey practice for many decades. However, as many new data collection methods become available, it is possible to improve the quality and efficiency of traditional survey practices by integrating different sample sources. Web-based surveys from the so-called opt-in panels are one type of nonprobability sample that becoming popular these years. They often come with large sample sizes to yield efficient estimates, but selection bias may compromise the generalizability of results to the broader population.

Our motivating example is a survey conducted by the National Marine Fisheries Service (NMFS), which collects data to …


Statistical Modeling Of High-Throughput Sequencing Data And Spatially Resolved Transcriptomic Data, Shen Yin Dec 2020

Statistical Modeling Of High-Throughput Sequencing Data And Spatially Resolved Transcriptomic Data, Shen Yin

Statistical Science Theses and Dissertations

Recent studies have shown that RNA sequencing (RNA-seq) can be used to measure mRNA of sufficient quality extracted from Formalin-Fixed Paraffin-Embedded (FFPE) tissues to provide whole-genome transcriptome analysis. However, little attention has been given to the normalization of FFPE RNA-seq data. In Chapters 1 and 2, we propose a new normalization method, labeled MIXnorm, and its simplified version SMIXnorm, for FFPE RNA-seq data. MIXnorm relies on a two-component mixture model, which models non-expressed genes by zero-inflated Poisson distributions and models expressed genes by truncated normal distributions. To obtain maximum likelihood estimates, we develop a nested EM algorithm, in which closed-form …


Improved Statistical Methods For Time-Series And Lifetime Data, Xiaojie Zhu Dec 2020

Improved Statistical Methods For Time-Series And Lifetime Data, Xiaojie Zhu

Statistical Science Theses and Dissertations

In this dissertation, improved statistical methods for time-series and lifetime data are developed. First, an improved trend test for time series data is presented. Then, robust parametric estimation methods based on system lifetime data with known system signatures are developed.

In the first part of this dissertation, we consider a test for the monotonic trend in time series data proposed by Brillinger (1989). It has been shown that when there are highly correlated residuals or short record lengths, Brillinger’s test procedure tends to have significance level much higher than the nominal level. This could be related to the discrepancy between …


Causal Inference And Prediction On Observational Data With Survival Outcomes, Xiaofei Chen Jul 2020

Causal Inference And Prediction On Observational Data With Survival Outcomes, Xiaofei Chen

Statistical Science Theses and Dissertations

Infants with hypoplastic left heart syndrome require an initial Norwood operation, followed some months later by a stage 2 palliation (S2P). The timing of S2P is critical for the operation’s success and the infant’s survival, but the optimal timing, if one exists, is unknown. We attempt to estimate the optimal timing of S2P by analyzing data from the Single Ventricle Reconstruction Trial (SVRT), which randomized patients between two different types of Norwood procedure. In the SVRT, the timing of the S2P was chosen by the medical team; thus with respect to this exposure, the trial constitutes an observational study, and …


Statistical Models And Analysis Of Univariate And Multivariate Degradation Data, Lochana Palayangoda May 2020

Statistical Models And Analysis Of Univariate And Multivariate Degradation Data, Lochana Palayangoda

Statistical Science Theses and Dissertations

For degradation data in reliability analysis, estimation of the first-passage time (FPT) distribution to a threshold provides valuable information on reliability characteristics. Recently, Balakrishnan and Qin (2019; Applied Stochastic Models in Business and Industry, 35:571-590) studied a nonparametric method to approximate the FPT distribution of such degradation processes if the underlying process type is unknown. In this thesis, we propose improved techniques based on saddlepoint approximation, which enhance upon their suggested methods. Numerical examples and Monte Carlo simulation studies are used to illustrate the advantages of the proposed techniques. Limitations of the improved techniques are discussed and some possible solutions …


Sensitivity Analysis For Incomplete Data And Causal Inference, Heng Chen May 2020

Sensitivity Analysis For Incomplete Data And Causal Inference, Heng Chen

Statistical Science Theses and Dissertations

In this dissertation, we explore sensitivity analyses under three different types of incomplete data problems, including missing outcomes, missing outcomes and missing predictors, potential outcomes in \emph{Rubin causal model (RCM)}. The first sensitivity analysis is conducted for the \emph{missing completely at random (MCAR)} assumption in frequentist inference; the second one is conducted for the \emph{missing at random (MAR)} assumption in likelihood inference; the third one is conducted for one novel assumption, the ``sixth assumption'' proposed for the robustness of instrumental variable estimand in causal inference.


Inference Of Heterogeneity In Meta-Analysis Of Rare Binary Events And Rss-Structured Cluster Randomized Studies, Chiyu Zhang Dec 2019

Inference Of Heterogeneity In Meta-Analysis Of Rare Binary Events And Rss-Structured Cluster Randomized Studies, Chiyu Zhang

Statistical Science Theses and Dissertations

This dissertation contains two topics: (1) A Comparative Study of Statistical Methods for Quantifying and Testing Between-study Heterogeneity in Meta-analysis with Focus on Rare Binary Events; (2) Estimation of Variances in Cluster Randomized Designs Using Ranked Set Sampling.

Meta-analysis, the statistical procedure for combining results from multiple studies, has been widely used in medical research to evaluate intervention efficacy and safety. In many practical situations, the variation of treatment effects among the collected studies, often measured by the heterogeneity parameter, may exist and can greatly affect the inference about effect sizes. Comparative studies have been done for only one or …


Sample Size Calculation Of Clinical Trials With Correlated Outcomes, Dateng Li Aug 2019

Sample Size Calculation Of Clinical Trials With Correlated Outcomes, Dateng Li

Statistical Science Theses and Dissertations

In this thesis, we investigate sample size calculation for three kinds of clinical trials: (1). Randomized controlled trials (RCTs) with longitudinal count outcomes; (2). Cluster randomized trials (CRTs) with count outcomes; (3). CRTs with multiple binary co-primary endpoints.


Clinical Trial Design And Analysis, Shuang Li Aug 2019

Clinical Trial Design And Analysis, Shuang Li

Statistical Science Theses and Dissertations

Clinical trials are experiments tested on human to compare the effect of certain intervention. In early-stage trials, fewer number of patients are enrolled to get preliminary information on safety and efficacy. In late-stage trials, larger number of patients are randomized to further confirm the efficacy and safety.

In Chapter 2, we propose a family of designs for phase I oncology trials. In these trials, oncologists assign different patients at a varying range of dose levels to find the dose that gives the highest acceptable rate of dose-limiting toxicities, which will be the recommended dose for phase II trials. Our proposed …