Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 51

Full-Text Articles in Entire DC Network

Bayesian Variational Inference In Keyword Identification And Multiple Instance Classification, Yaofang Hu Aug 2024

Bayesian Variational Inference In Keyword Identification And Multiple Instance Classification, Yaofang Hu

Statistical Science Theses and Dissertations

This dissertation investigates (1) Variational Bayesian Semi-supervised Keyword Extraction and (2) Variational Bayesian Multimodal Multiple Instance Classification.

The expansion of textual data, stemming from various sources such as online product reviews and scholarly publications on scientific discoveries, has created a demand for the extraction of succinct yet comprehensive information. As a result, in recent years, efforts have been spent in developing novel methodologies for keyword extraction. Although many methods have been proposed to automatically extract keywords in the contexts of both unsupervised and fully supervised learning, how to effectively use partially observed keywords, such as author-specified keywords, remains an under-explored …


Interpretable Word-Level Sentiment Analysis With Attention-Based Multiple Instance Classification Models, Chenyu Yang Dec 2023

Interpretable Word-Level Sentiment Analysis With Attention-Based Multiple Instance Classification Models, Chenyu Yang

Statistical Science Theses and Dissertations

In this study, our main objective is to tackle the black-box nature of popular machine learning models in sentiment analysis and enhance model interpretability. We aim to gain more insight into the decision-making process of sentiment analysis models, which is often obscure in those complex models. To achieve this goal, we introduce two word-level sentiment analysis models.

The first model is called the attention-based multiple instance classification (AMIC) model. It combines the transparent model structure of multiple instance classification and the self-attention mechanism in deep learning to incorporate the contextual information from documents. As demonstrated by a wine review dataset …


Differentiation Of Human, Dog, And Cat Hair Fibers Using Dart Tofms And Machine Learning, Laura Ahumada, Erin R. Mcclure-Price, Chad Kwong, Edgard O. Espinoza, John Santerre Dec 2023

Differentiation Of Human, Dog, And Cat Hair Fibers Using Dart Tofms And Machine Learning, Laura Ahumada, Erin R. Mcclure-Price, Chad Kwong, Edgard O. Espinoza, John Santerre

SMU Data Science Review

Hair is found in over 90% of crime scenes and has long been analyzed as trace evidence. However, recent reviews of traditional hair fiber analysis techniques, primarily morphological examination, have cast doubt on its reliability. To address these concerns, this study employed machine learning algorithms, specifically Linear Discriminant Analysis (LDA) and Random Forest, on Direct Analysis in Real Time time-of-flight mass spectra collected from human, cat, and dog hair samples. The objective was to develop a chemistry- and statistics-based classification method for unbiased taxonomic identification of hair. The results of the study showed that LDA and Random Forest were highly …


A Prompt Engineering Approach To Creating Automated Commentary For Microsoft Self-Help Documentation Metric Reports Using Chatgpt, Ryan Herrin, Luke Stodgel, Brian Raffety Dec 2023

A Prompt Engineering Approach To Creating Automated Commentary For Microsoft Self-Help Documentation Metric Reports Using Chatgpt, Ryan Herrin, Luke Stodgel, Brian Raffety

SMU Data Science Review

Microsoft collects an immense amount of data from the users of their product-self-help documentation. Employees use this data to identify these self-help articles' performance trends and measure their impact on business Key Performance Indicators (KPIs). Microsoft uses various tools like Power BI and Python to analyze this data. The problem is that their analysis and findings are summarized manually. Therefore, this research will improve upon their current analysis methods by applying the latest prompt engineering practices and the power of ChatGPT's large language models (LLMs). Using VBA code, Microsoft Excel, and the ChatGPT API as an Excel add-in, this research …


Bayesian Statistical Modeling Of Spatially Resolved Transcriptomics Data, Xi Jiang Oct 2023

Bayesian Statistical Modeling Of Spatially Resolved Transcriptomics Data, Xi Jiang

Statistical Science Theses and Dissertations

Spatially resolved transcriptomics (SRT) quantifies expression levels at different spatial locations, providing a new and powerful tool to investigate novel biological insights. As experimental technologies enhance both in capacity and efficiency, there arises a growing demand for the development of analytical methodologies.

One question in SRT data analysis is to identify genes whose expressions exhibit spatially correlated patterns, called spatially variable (SV) genes. Most current methods to identify SV genes are built upon the geostatistical model with Gaussian process, which could limit the models' ability to identify complex spatial patterns. In order to overcome this challenge and capture more types …


Traditional Vs Machine Learning Approaches: A Comparison Of Time Series Modeling Methods, Miguel E. Bonilla Jr., Jason Mcdonald, Tamas Toth, Bivin Sadler Aug 2023

Traditional Vs Machine Learning Approaches: A Comparison Of Time Series Modeling Methods, Miguel E. Bonilla Jr., Jason Mcdonald, Tamas Toth, Bivin Sadler

SMU Data Science Review

In recent years, various new Machine Learning and Deep Learning algorithms have been introduced, claiming to offer better performance than traditional statistical approaches when forecasting time series. Studies seeking evidence to support the usage of ML/DL over statistical approaches have been limited to comparing the forecasting performance of univariate, linear time series data. This research compares the performance of traditional statistical-based and ML/DL methods for forecasting multivariate and nonlinear time series.


A Comparison Of Confidence Intervals In State Space Models, Jinyu Du Jul 2023

A Comparison Of Confidence Intervals In State Space Models, Jinyu Du

Statistical Science Theses and Dissertations

This thesis develops general procedures for constructing confidence intervals (CIs) of the error disturbance parameters (standard deviations) and transformations of the error disturbance parameters in time-invariant state space models (ssm). With only a set of observations, estimating individual error disturbance parameters accurately in the presence of other unknown parameters in ssm is a very challenging problem. We attempted to construct four different types of confidence intervals, Wald, likelihood ratio, score, and higher-order asymptotic intervals for both the simple local level model and the general time-invariant state space models (ssm). We show that for a simple local level model, both the …


Optimal Experimental Planning Of Reliability Experiments Based On Coherent Systems, Yang Yu Jul 2023

Optimal Experimental Planning Of Reliability Experiments Based On Coherent Systems, Yang Yu

Statistical Science Theses and Dissertations

In industrial engineering and manufacturing, assessing the reliability of a product or system is an important topic. Life-testing and reliability experiments are commonly used reliability assessment methods to gain sound knowledge about product or system lifetime distributions. Usually, a sample of items of interest is subjected to stresses and environmental conditions that characterize the normal operating conditions. During the life-test, successive times to failure are recorded and lifetime data are collected. Life-testing is useful in many industrial environments, including the automobile, materials, telecommunications, and electronics industries.

There are different kinds of life-testing experiments that can be applied for different purposes. …


Empirical Likelihood Ratio Tests For Homogeneity Of Distributions Of Component Lifetimes From System Lifetime Data With Known System Structures, Jingjing Qu May 2023

Empirical Likelihood Ratio Tests For Homogeneity Of Distributions Of Component Lifetimes From System Lifetime Data With Known System Structures, Jingjing Qu

Statistical Science Theses and Dissertations

In system reliability, practitioners may be interested in testing the homogeneity of the component lifetime distributions based on system lifetimes from multiple data sources for various reasons, such as identifying the component supplier that provides the most reliable components.

In the first part of the dissertation, we develop distribution-free hypothesis testing procedures for the homogeneity of the component lifetime distributions based on system lifetime data when the system structures are known. Several nonparametric testing statistics based on the empirical likelihood method are proposed for testing the homogeneity of two or more component lifetime distributions. The computational approaches to obtain the …


Optimizing Tumor Xenograft Experiments Using Bayesian Linear And Nonlinear Mixed Modelling And Reinforcement Learning, Mary Lena Bleile May 2023

Optimizing Tumor Xenograft Experiments Using Bayesian Linear And Nonlinear Mixed Modelling And Reinforcement Learning, Mary Lena Bleile

Statistical Science Theses and Dissertations

Tumor xenograft experiments are a popular tool of cancer biology research. In a typical such experiment, one implants a set of animals with an aliquot of the human tumor of interest, applies various treatments of interest, and observes the subsequent response. Efficient analysis of the data from these experiments is therefore of utmost importance. This dissertation proposes three methods for optimizing cancer treatment and data analysis in the tumor xenograft context. The first of these is applicable to tumor xenograft experiments in general, and the second two seek to optimize the combination of radiotherapy with immunotherapy in the tumor xenograft …


Development Of Bayesian Hierarchical Methods Involving Meta-Analysis, Jackson Barth May 2023

Development Of Bayesian Hierarchical Methods Involving Meta-Analysis, Jackson Barth

Statistical Science Theses and Dissertations

When conducting statistical analysis in the Bayesian paradigm, the most critical decision made by the researcher is the identification of a prior distribution for a parameter. Despite the mathematical soundness of the Bayesian approach, a wrongly specified prior can lead to biased and incorrect results. To avoid this, prior distributions should be based on real data, which are easily accessible in the "big data" era. This dissertation explores two applications of Bayesian hierarchical modelling that incorporate information obtained from a meta-analysis.

The first of these applications is in the normalization of genomics data, specifically for nanostring nCounter datasets. A meta-analysis …


Contributions To Causal Inference In Observational Studies, Jenny Park, Daniel F. Heitjan, Christy Boling Turer May 2023

Contributions To Causal Inference In Observational Studies, Jenny Park, Daniel F. Heitjan, Christy Boling Turer

Statistical Science Theses and Dissertations

The electronic health record (EHR) is a digital version of the patient chart. All clinically relevant patient information can be accessed from the EHR by professionals involved in the patient’s care. For researchers, the EHR is a rich, convenient source for data to address a vast range of medical research questions.

In observational studies with EHR data, it is common to define the treatment/exposure status as a binary indicator reflecting whether patient was documented to receive a particular medication or procedure. The outcome can be any type of information on patient status documented in the EHR after the treatment has …


Nonconvex Optimization For Statistical Learning With Structured Sparsity, Chengyu Ke Apr 2023

Nonconvex Optimization For Statistical Learning With Structured Sparsity, Chengyu Ke

Operations Research and Engineering Management Theses and Dissertations

Sparse learning problems, known as feature selection problems or variable selection problems, are a popular branch in the field of statistical learning. When faced with a dataset with only a few observations but a large number of features, we are interested in extracting the most useful features automatically by solving an optimization problem. In this dissertation, we start by introducing a novel penalty function as well as an iterative reweighted algorithm to solve the group sparsity problem, a special type of feature selection problems. The penalty function, named group LOG, shows a better ability to recover the ground-truth compared to …


Bayesian Methods For Random-Effects Meta-Analysis Of Rare Binary Events In Biomedical Research, Ming Zhang Apr 2023

Bayesian Methods For Random-Effects Meta-Analysis Of Rare Binary Events In Biomedical Research, Ming Zhang

Statistical Science Theses and Dissertations

Rare binary events data arise frequently in medical research. Due to lack of statistical power in individual studies involving such data, meta-analysis has become an increasingly important tool for combining results from multiple independent studies. However, traditional meta-analysis methods often report severely biased estimates in such rare-event settings. Moreover, many rely on models assuming a pre-specified direction for variability between control and treatment groups for mathematical convenience, which may be violated in practice. In Chapter 1, based on a flexible random-effects model that removes the assumption about the direction, we propose new Bayesian procedures for estimating and testing the overall …


Influence Diagnostics For Generalized Estimating Equations Applied To Correlated Categorical Data, Louis Vazquez Apr 2023

Influence Diagnostics For Generalized Estimating Equations Applied To Correlated Categorical Data, Louis Vazquez

Statistical Science Theses and Dissertations

Influence diagnostics in regression analysis allow analysts to identify observations that have a strong influence on model fitted probabilities and parameter estimates. The most common influence diagnostics, such as Cook’s Distance for linear regression, are based on a deletion approach where the results of a model with and without observations of interest are compared. Here, deletion-based influence diagnostics are proposed for generalized estimating equations (GEE) for correlated, or clustered, nominal multinomial responses. The proposed influence diagnostics focus on GEEs with the baseline-category logit link function and a local odds ratio parameterization of the association structure. Formulas for both observation- and …


Character Evidence As A Conduit For Implicit Bias, Hillel J. Bavli Jan 2023

Character Evidence As A Conduit For Implicit Bias, Hillel J. Bavli

Faculty Journal Articles and Book Chapters

The Federal Rules of Evidence purport to prohibit character evidence, or evidence regarding a defendant’s past bad acts or propensities offered to suggest that the defendant acted in accordance with a certain character trait on the occasion in question. However, courts regularly admit character evidence through an expanding set of legislative and judicial exceptions that have all but swallowed the rule. In the usual narrative, character evidence is problematic because jurors place excessive weight on it or punish the defendant for past behavior. Lawmakers rely on this narrative when they create exceptions. However, this account arguably misses a highly troublesome …


Regression Modeling Of Complex Survival Data Based On Pseudo-Observations, Rong Rong Dec 2022

Regression Modeling Of Complex Survival Data Based On Pseudo-Observations, Rong Rong

Statistical Science Theses and Dissertations

The restricted mean survival time (RMST) is a clinically meaningful summary measure in studies with survival outcomes. Statistical methods have been developed for regression analysis of RMST to investigate impacts of covariates on RMST, which is a useful alternative to the Cox regression analysis. However, existing methods for regression modeling of RMST are not applicable to left-truncated right-censored data that arise frequently in prevalent cohort studies, for which the sampling bias due to left truncation and informative censoring induced by the prevalent sampling scheme must be properly addressed. Meanwhile, statistical methods have been developed for regression modeling of the cumulative …


Study Of Stochastic Market Clearing Problems In Power Systems With High Renewable Integration, Saumya Sakitha Sashrika Ariyarathne Oct 2022

Study Of Stochastic Market Clearing Problems In Power Systems With High Renewable Integration, Saumya Sakitha Sashrika Ariyarathne

Operations Research and Engineering Management Theses and Dissertations

Integrating large-scale renewable energy resources into the power grid poses several operational and economic problems due to their inherently stochastic nature. The lack of predictability of renewable outputs deteriorates the power grid’s reliability. The power system operators have recognized this need to account for uncertainty in making operational decisions and forming electricity pricing. In this regard, this dissertation studies three aspects that aid large-scale renewable integration into power systems. 1. We develop a nonparametric change point-based statistical model to generate scenarios that accurately capture the renewable generation stochastic processes; 2. We design new pricing mechanisms derived from alternative stochastic programming …


Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun Aug 2022

Dynamic Prediction For Alternating Recurrent Events Using A Semiparametric Joint Frailty Model, Jaehyeon Yun

Statistical Science Theses and Dissertations

Alternating recurrent events data arise commonly in health research; examples include hospital admissions and discharges of diabetes patients; exacerbations and remissions of chronic bronchitis; and quitting and restarting smoking. Recent work has involved formulating and estimating joint models for the recurrent event times considering non-negligible event durations. However, prediction models for transition between recurrent events are lacking. We consider the development and evaluation of methods for predicting future events within these models. Specifically, we propose a tool for dynamically predicting transition between alternating recurrent events in real time. Under a flexible joint frailty model, we derive the predictive probability of …


Equity Of Urban Neighborhood Infrastructure: A Data-Driven Assessment, Zheng Li May 2022

Equity Of Urban Neighborhood Infrastructure: A Data-Driven Assessment, Zheng Li

Civil and Environmental Engineering Theses and Dissertations

Neighborhood infrastructure, such as sidewalks, medical facilities, public transit, community gathering places, and tree canopy, provides essential support for safe, healthy, and
resilient communities. This thesis proposes, develops, and implements an innovative approach to thoroughly examine the presence and condition of neighborhood infrastructure.
It demonstrates the necessity of considering multiple infrastructure types when studying
neighborhood infrastructure and its equity. This thesis provides an automated assessment
framework as well as case studies among four major metropolitan cities across the United
States, which expands the research opportunities for future infrastructure-related research.


Compositional Datasets And The Nested Dirichlet Distribution, Bianca Luedeker Jan 2022

Compositional Datasets And The Nested Dirichlet Distribution, Bianca Luedeker

Statistical Science Theses and Dissertations

Compositional data is a type of multivariate data where each component of a vector is sandwiched between 0 and 1 and the sum of the components is 1. For example, the proportion of time that each of 7 mice spend in one of four quadrants of a circular water maze is between 0 and 1, and the total proportion of time spent in the maze is 1. If there are two sets of mice, one set of normal mice and one set of cognitively impaired mice, the experiment has a two-sample design. Such data is frequently analyzed incorrectly by comparing …


Differential Methods In Modern Biological Data Analysis, Micah Thornton Dec 2021

Differential Methods In Modern Biological Data Analysis, Micah Thornton

Statistical Science Theses and Dissertations

Analysis of biological data for differentiation of organisms/cells within and across species or even the same organism is important to a wide variety of applications. This work considers three different biological data sets at the genome, proteome, and epigenome levels: respectively, DNA sequences, glycosalation data, and DNA methylation. We explore some statistical modeling approaches for handling these modern datasets, and provide a relevant set of experiments for explanation and illustration.

First, genomic Fourier coefficients, which capture information about the harmonics of genetic sequences in terms of nucleotide pattern recurrence are investigated as summary metrics for medium sized virus genomes from …


Exact Inference For Meta-Analysis Of Rare Events And Its Application In Human Genetics, Yanqiu Shao Dec 2021

Exact Inference For Meta-Analysis Of Rare Events And Its Application In Human Genetics, Yanqiu Shao

Statistical Science Theses and Dissertations

Meta-analysis is a statistical approach that integrates data from multiple studies. By aggregating information, it enhances the power to detect the effects of interest and provides an estimate of the effect size with both accuracy and precision. Both fixed-effect and random-effect models are developed and widely used in biomedical research including clinical trials and genomic studies. In the case of rare events data, conventional meta-analysis methods that rely on large sample approximation may not be able to make reliable inferences. There have been various approaches proposed to deal with this situation, in particular, rare binary adverse events in clinical studies. …


Ultra-High Dimensional Bayesian Variable Selection With Lasso-Type Priors, Can Xu Oct 2021

Ultra-High Dimensional Bayesian Variable Selection With Lasso-Type Priors, Can Xu

Statistical Science Theses and Dissertations

With the rapid development of new data collection and acquisition techniques, high-dimensional data have emerged from various fields. Consequentially, new variable selection methods especially in ultra-high dimensional problems are demanding.

The first part of this dissertation focuses on developing a new Bayesian variable selection method for a differential expression analysis using raw NanoString nCounter data. The medium-throughput mRNA abundance platform NanoString nCounter has gained great popularity in the past decade, due to its high sensitivity and technical reproducibility as well as remarkable applicability to ubiquitous formalin fixed paraffin embedded (FFPE) tissue samples. Based on RCRnorm developed for normalizing NanoString nCounter …


Bayesian Statistical Modeling Of Metagenomics Sequencing Data, Shuang Jiang Aug 2021

Bayesian Statistical Modeling Of Metagenomics Sequencing Data, Shuang Jiang

Statistical Science Theses and Dissertations

Microbiome count data are high-dimensional and usually suffer from uneven sampling depth, over-dispersion, and zero-inflation. In this thesis, we develop specialized analytical models for analyzing such count data. In Chapter 2, I develop a bi-level Bayesian hierarchical framework for microbiome differential abundance analysis. The bottom level is a multivariate count-generating process that links the observed counts to their latent normalized abundances. The top level is a mixture of Gaussian distributions with a feature selection scheme for differential abundance analysis. A simulation study on both simulated and synthetic data is conducted. A colorectal cancer case study demonstrates that a resulting diagnostic …


Estimation Of Parameters Of Gamma And Generalized Gamma Distributions Based On Censored Experimental Data, Xiangwen Shang Aug 2021

Estimation Of Parameters Of Gamma And Generalized Gamma Distributions Based On Censored Experimental Data, Xiangwen Shang

Statistical Science Theses and Dissertations

In time-to-event data analysis, censoring is one of the unique features that restricts our ability to observe the time-to-events and poses difficulties for statistical analysis. Censoring occurs when the exact time-to-event cannot be observed for some or all observations. In this thesis, we study the parameter estimation methods for a two-parameter gamma distribution and a three-parameter generalized gamma distribution based on different kinds of censored data arising from life-testing experiments.

We first study the parameter estimation of a three-parameter generalized gamma distribution based on left-truncated and right-censored data. It is well known that the maximum likelihood estimates of the parameters …


Modified Degradation Process Models And Statistical Methods For Assessing Robustness And Reliability Of Complex Networks, Yuzhou Chen Aug 2021

Modified Degradation Process Models And Statistical Methods For Assessing Robustness And Reliability Of Complex Networks, Yuzhou Chen

Statistical Science Theses and Dissertations

In this thesis, we develop a novel stochastic modeling approach based on multiple interdependent topological measures of complex networks. The key engine behind our approach is to evaluate the dynamics of multiple network motifs as descriptors of the underlying network topology. Under a framework of the gamma degradation model, we develop a formal statistical framework for the analysis of reliability and robustness of a single complex network as well as for assessing differences in reliability properties exhibited by two different networks. We validate the proposed methodology with Monte Carlo simulation studies and illustrate the utility of the proposed approach by …


Examining Multiple Imputation For Measurement Error Correction In Count Data With Excess Zeros, Shalima Zalsha Dec 2020

Examining Multiple Imputation For Measurement Error Correction In Count Data With Excess Zeros, Shalima Zalsha

Statistical Science Theses and Dissertations

Measurement error and missing data are two common problems in wildlife population surveys. These data are collected from the environment and may be missing or measured with error when the observer’s ability to see the animal is obscured. Methods such as video transects for estimating red snapper abundance and aerial surveys for estimating moose population sizes are highly affected by these problems since total abundance will be underestimated if missing/mismeasured counts are ignored. We shall refer to this problem as visibility bias; it occurs when the true counts are observed when visibility is high, partially observed when visibility is low …


Integrating Different Data Sources For Estimation Of Total With Unknown Population Size, Zhaoce Liu Dec 2020

Integrating Different Data Sources For Estimation Of Total With Unknown Population Size, Zhaoce Liu

Statistical Science Theses and Dissertations

Probability sampling has served as the gold-standard in survey practice for many decades. However, as many new data collection methods become available, it is possible to improve the quality and efficiency of traditional survey practices by integrating different sample sources. Web-based surveys from the so-called opt-in panels are one type of nonprobability sample that becoming popular these years. They often come with large sample sizes to yield efficient estimates, but selection bias may compromise the generalizability of results to the broader population.

Our motivating example is a survey conducted by the National Marine Fisheries Service (NMFS), which collects data to …


Statistical Modeling Of High-Throughput Sequencing Data And Spatially Resolved Transcriptomic Data, Shen Yin Dec 2020

Statistical Modeling Of High-Throughput Sequencing Data And Spatially Resolved Transcriptomic Data, Shen Yin

Statistical Science Theses and Dissertations

Recent studies have shown that RNA sequencing (RNA-seq) can be used to measure mRNA of sufficient quality extracted from Formalin-Fixed Paraffin-Embedded (FFPE) tissues to provide whole-genome transcriptome analysis. However, little attention has been given to the normalization of FFPE RNA-seq data. In Chapters 1 and 2, we propose a new normalization method, labeled MIXnorm, and its simplified version SMIXnorm, for FFPE RNA-seq data. MIXnorm relies on a two-component mixture model, which models non-expressed genes by zero-inflated Poisson distributions and models expressed genes by truncated normal distributions. To obtain maximum likelihood estimates, we develop a nested EM algorithm, in which closed-form …