Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 49

Full-Text Articles in Physical Sciences and Mathematics

A Review Of Recent Gene Expression-Based And Dna Methylation-Based Mathematical Cell Type Deconvolution Methods, Chenxiao Tian Aug 2023

A Review Of Recent Gene Expression-Based And Dna Methylation-Based Mathematical Cell Type Deconvolution Methods, Chenxiao Tian

Arts & Sciences Electronic Theses and Dissertations

In recent years, many cell type deconvolution methods based on DNA methylation data and gene expression data have been developed. Both of these two methods have its special advantages and disadvantages, e.g., DNA methylation-based methods’ data source is usually more stable than gene expression and DNA methylation is easier to measure in FFPE tissues or formalin-fixed paraffin-embedded, while some gene-expression data like scRNA-seq data usually has high cost and complexity. On the other hand, gene expression-based deconvolution methods currently have many more available methods than DNA methylation-based deconvolution methods, which leads to DNA methylation-based methods in many cases can learn …


Effects Of Functional Network Model Definition On Biomarker Outcome Prediction, Xinyang Feng May 2023

Effects Of Functional Network Model Definition On Biomarker Outcome Prediction, Xinyang Feng

Arts & Sciences Electronic Theses and Dissertations

Machine learning (ML) models are widely used to investigate the human connectome and to predict and understand behavior, emotion, and cognition. Prior research has organized pediatric connectome data using adult functional network models. However, this assumes that adult functional network models are appropriate and useful for prediction developmental outcomes from pediatric connectome data. We hypothesize that the application of adult brain network models could result in poor model fit, limiting the generalizability of results. Here, we test whether prediction of biological age is improved by concordant brain network models matching underlying functional connectome data. To quantify the difference in age …


Dealing With Dimensionality: Problems And Techniques In High-Dimensional Statistics, Cezareo Rodriguez Dec 2022

Dealing With Dimensionality: Problems And Techniques In High-Dimensional Statistics, Cezareo Rodriguez

Arts & Sciences Electronic Theses and Dissertations

In modern data analysis, problems involving high dimensional data with more variables than subjects is increasingly common. Two such cases are mediation analysis and distributed optimization. In Chapter 2 we start with an overview of high dimensional statistics and mediation analysis. In Chapter 3 we motivate and prove properties for a new marginal screening procedure for performing high dimensional mediation analysis. This screening procedure is shown via simulation to perform better than benchmark approaches and is applied to a DNA methylation study. In Chapter 4 we construct a cryptosystem that accurately performs distributed penalized quantile regression in the high-dimensional setting …


Kernel Estimation Of Spot Volatility And Its Application In Volatility Functional Estimation, Bei Wu Dec 2022

Kernel Estimation Of Spot Volatility And Its Application In Volatility Functional Estimation, Bei Wu

Arts & Sciences Electronic Theses and Dissertations

It\^o semimartingale models for the dynamics of asset returns have been widely studied in financial econometrics. A key component of the model, spot volatility, plays a crucial role in option pricing, portfolio management, and financial risk assessment. In this dissertation, we consider three problems related to the estimation of spot volatility using high-frequency asset returns. We first revisit the problem of estimating the spot volatility of an It\^o semimartingale using a kernel estimator. We prove a Central Limit Theorem with an optimal convergence rate for a general two-sided kernel under quite mild assumptions, which includes leverage effects and jumps of …


Contribution To Data Science: Time Series, Uncertainty Quantification And Applications, Dhrubajyoti Ghosh Dec 2022

Contribution To Data Science: Time Series, Uncertainty Quantification And Applications, Dhrubajyoti Ghosh

Arts & Sciences Electronic Theses and Dissertations

Time series analysis is an essential tool in modern world statistical analysis, with a myriad of real data problems having temporal components that need to be studied to gain a better understanding of the temporal dependence structure in the data. For example, in the stock market, it is of significant importance to identify the ups and downs of the stock prices, for which time series analysis is crucial. Most of the existing literature on time series deals with linear time series, or with Gaussianity assumption. However, there are multiple instances where the time series shows nonlinear trends, or when the …


Dataset Evaluation For Data Trading Using Expected Loss And Homomorphic Encryption, Minsung Joo May 2022

Dataset Evaluation For Data Trading Using Expected Loss And Homomorphic Encryption, Minsung Joo

Senior Honors Papers / Undergraduate Theses

Supervised machine learning suffers from the ``garbage-in garbage-out" phenomenon where the performance of a model is limited by the quality of the data. While a myriad of data is collected every second, there is no general rigorous method of evaluating the quality of a given dataset. This hinders fair pricing of data in scenarios where a buyer may look to buy data for use with machine learning. In this work, I propose using the expected loss corresponding to a dataset as a measure of its quality, relying on Bayesian methods for uncertainty quantification. Furthermore, I present a secure multi-party computation …


Association Of Structural Variation (Sv) With Cardiometabolic Traits In Finns, Lei Chen Aug 2021

Association Of Structural Variation (Sv) With Cardiometabolic Traits In Finns, Lei Chen

Arts & Sciences Electronic Theses and Dissertations

Cardiovascular diseases (CVDs) are known to be associated with a variety of quantitative risk factors such as cholesterol, metabolites, and insulin. Understanding the genetic basis of these quantitative traits can shed light on the etiology, prevention, diagnosis, and treatment of disease. However most prior trait-mapping studies have focused on single nucleotide variants (SNVs) and Indels, with the contribution of structural variation (SV) remaining unknown. In this thesis, we present the results of a study examining genetic association between SVs and cardiometabolic traits in the Finnish population. In the first chapter, we used sensitive methods to identify and genotype 129,166 high-confidence …


Market Making In A Limit Order Book: Classical Optimal Control And Reinforcement Learning Approaches, Chuyi Yu Aug 2021

Market Making In A Limit Order Book: Classical Optimal Control And Reinforcement Learning Approaches, Chuyi Yu

Arts & Sciences Electronic Theses and Dissertations

Since the last decade, algorithmic trading has become one of the most significant developments in electronic security markets. Several types of problems and practices have been studied such as optimal execution, market making, statistical arbitrage, latency arbitrage, and so on. Among these, high-frequency market making plays a crucial role since it provides large liquidity to the market, which makes trading and investing cheaper for other market participants, and also creates sizable profits for high-frequency market makers (HFM) from the large quantity of round-trip executions involved in such practices. In this thesis, we discuss two approaches to solve the high-frequency market …


Smooth Ica Under Time Pattern Assumptions, Jiayi Fu Aug 2021

Smooth Ica Under Time Pattern Assumptions, Jiayi Fu

Arts & Sciences Electronic Theses and Dissertations

Independent component analysis (ICA) is wildly used in differently areas. As traditional ICA models make no assumptions on time pattern, they do not take time domain information into consideration. In this thesis, we introduced new assumptions that allow local dependence over time, and we built smooth ICA models to utilize the smoothness information for sources signals. Based on the local dependence assumptions, constrained optimization problems with smoothing penalty were discussed. Then we introduced smooth ICA estimators and estimating equations. Under local dependence assumptions, we gave proofs about the consistency and asymptotic normality of these estimators. We derived the Newton iterative …


Adaptive Optimal Market Making Strategies With Inventory Liquidation Cost, Yi Zhang May 2021

Adaptive Optimal Market Making Strategies With Inventory Liquidation Cost, Yi Zhang

Arts & Sciences Electronic Theses and Dissertations

Along the lines of the paper \cite{zoe}, we find a general form of the optimal market making strategy for a high-frequency market maker (HFM) in a discrete-time Limit Order Book (LOB) model. Unlike \cite{zoe}, the optimal market making strategy is adaptive depending on the arrival of Market Order (MO) in the previous time intervals. We provide a method to make each placement of Limit Orders (LO) dependent on previous information in the same trading day and prove the admissibility of the optimal market making strategy under some general assumptions. Empirical study shows the adaptive optimal strategies outperform the non-adaptive strategy …


Machine Learning Morphisms: A Framework For Designing And Analyzing Machine Learning Work Ows, Applied To Separability, Error Bounds, And 30-Day Hospital Readmissions, Eric Zenon Cawi Jan 2021

Machine Learning Morphisms: A Framework For Designing And Analyzing Machine Learning Work Ows, Applied To Separability, Error Bounds, And 30-Day Hospital Readmissions, Eric Zenon Cawi

McKelvey School of Engineering Theses & Dissertations

A machine learning workflow is the sequence of tasks necessary to implement a machine learning application, including data collection, preprocessing, feature engineering, exploratory analysis, and model training/selection. In this dissertation we propose the Machine Learning Morphism (MLM) as a mathematical framework to describe the tasks in a workflow. The MLM is a tuple consisting of: Input Space, Output Space, Learning Morphism, Parameter Prior, Empirical Risk Function. This contains the information necessary to learn the parameters of the learning morphism, which represents a workflow task. In chapter 1, we give a short review of typical tasks present in a workflow, as …


Genetics Of Pediatric Musculoskeletal Disorders, Lilian Antunes Jan 2021

Genetics Of Pediatric Musculoskeletal Disorders, Lilian Antunes

Arts & Sciences Electronic Theses and Dissertations

Pediatric musculoskeletal disorders are an extremely broad category of diseases that are often inherited. While individually rare, collectively these disorders are common, affecting around 3% of live births in the US. Despite the mounting clinical and molecular evidence for a genetic etiology, the cause for many patients with pediatric musculoskeletal disorders remain largely unknown. Major challenges in rare pediatric diseases include recruiting large numbers of patients and determining the significance and functional impacts of variants associated with disease within individuals or families. Whole exome sequencing (WES) is a powerful tool to identify coding variants that are associated with rare pediatric …


Wavelet Coherence Analysis With An Application Of Brain Images, Yiqian Fang Aug 2020

Wavelet Coherence Analysis With An Application Of Brain Images, Yiqian Fang

Arts & Sciences Electronic Theses and Dissertations

Wavelet analysis has become an emerging method in a wide range of applications with non-stationary data. In this work, we apply wavelets to tackle the problem of estimating dynamic association in a collection of multivariate non-stationary time series. Coherence is a common metric for linear dependence across signals. However, it assumes static dependence and does not sufficiently model many biological processes with time-evolving dependence structures. We explore continuous wavelet analysis for modeling and estimating such dynamic dependence under the replicated multivariate time series settings. Wavelet transformation provides a decomposition of signals that localizes in both time and frequency domains, hence …


Multi-Omics Integration For Gene Fusion Discovery And Somatic Mutation Haplotyping In Cancer, Steven Mason Foltz May 2020

Multi-Omics Integration For Gene Fusion Discovery And Somatic Mutation Haplotyping In Cancer, Steven Mason Foltz

Arts & Sciences Electronic Theses and Dissertations

Cancer is a disease caused by changes to the genome and dysregulation of gene expression. Among many types of mutations, including point mutations, small insertions and deletions, large scale structural variants, and copy number changes, gene fusions are another category of genomic and transcriptomic alteration that can lead to cancer and which can serve as therapeutic targets. We studied gene fusion events using data from The Cancer Genome Atlas, including over 9,000 patients from 33 cancer types, finding patterns of gene fusion events and dysregulation of gene expression within and across cancer types. With data from the CoMMpass study (Multiple …


Bayesian Posterior Inference And Lan For L̩Vy Models Under High-Frequency Data, Qi Wang May 2020

Bayesian Posterior Inference And Lan For L̩Vy Models Under High-Frequency Data, Qi Wang

Arts & Sciences Electronic Theses and Dissertations

Parameter estimation and inference for L̩vy models under high-frequency data has been an exciting and important task in the field of financial mathematics and has been found practically useful when analyzing real financial data. One feature of L̩vy models is the allowance of jumps to model the abrupt changes sometimes observed in the market. In this thesis, we discuss some problems related to the statistical inference of L̩vy models based on high-frequency data emphasizing on the presence of the jumps. The first problem we consider focuses on the estimation of the volatility, which is critical to measure and control the …


Bayesian Variable Selection And Post-Selection Inference, Qiyiwen Zhang May 2020

Bayesian Variable Selection And Post-Selection Inference, Qiyiwen Zhang

Arts & Sciences Electronic Theses and Dissertations

In this dissertation, we first develop a novel perspective to compare Bayesian variable selection procedures in terms of their selection criteria as well as their finite-sample properties. Secondly, we investigate Bayesian post-selection inference in two types of selection problems: linear regression and population selection. We will demonstrate that both inference problems are susceptible to selection effects since the selection procedure is data-dependent. Before comparing Bayesian variable selection procedures, we first classify the current Bayesian variable selection procedures into two classes: those with selection criteria defined on the space of candidate models, and those with selection criteria not explicitly formulated on …


Predicting Disease Progression Using Deep Recurrent Neural Networks And Longitudinal Electronic Health Record Data, Seunghwan Kim May 2020

Predicting Disease Progression Using Deep Recurrent Neural Networks And Longitudinal Electronic Health Record Data, Seunghwan Kim

McKelvey School of Engineering Theses & Dissertations

Electronic Health Records (EHR) are widely adopted and used throughout healthcare systems and are able to collect and store longitudinal information data that can be used to describe patient phenotypes. From the underlying data structures used in the EHR, discrete data can be extracted and analyzed to improve patient care and outcomes via tasks such as risk stratification and prospective disease management. Temporality in EHR is innately present given the nature of these data, however, and traditional classification models are limited in this context by the cross- sectional nature of training and prediction processes. Finding temporal patterns in EHR is …


Variational Inference For Quantile Rgression, Bufei Guo May 2019

Variational Inference For Quantile Rgression, Bufei Guo

Arts & Sciences Electronic Theses and Dissertations

Quantile regression (QR) (Koenker and Bassett, 1978), is an alternative to classic lin- ear regression with extensive applications in many fields. This thesis studies Bayesian quantile regression (Yu and Moyeed, 2001) using variational inference, which is one of the alternative methods to the Markov chain Monte Carlo (MCMC) in approximating intractable posterior distributions. The lasso regularization is shown to be effective in improving the accuracy of quantile regression (Li and Zhu, 2008). This thesis developed variational inference for quantile regression and regularized quantile regression with the lasso penalty. Simulation results show that variational inference is a computationally more efficient alternative …


Essays On Econometrics And Rational Choice, Junnan He May 2019

Essays On Econometrics And Rational Choice, Junnan He

Arts & Sciences Electronic Theses and Dissertations

Decision and choice theory is a topic of interest in both econometrics and microeconomic theory. We contribute to the theory of decision under both contexts, that is, the theory of model selection in econometrics, and the theory of rational decision in microeconomics.

There is a long-lasting theoretical interest in model selection. More recently, research on sparse estimators, a class of estimation methods that select and estimate important parameters simultaneously, has been the central focus on model selection. The methods become especially relevant when the problem is of high-dimensional nature. Theoretically, sparse methods can perform well when the true data generating …


Quantifying Lithochemical Diversity Of Martian Materials Using Hierarchical Clustering And A Similarity Index For Classification, Michael Conner Bouchard May 2019

Quantifying Lithochemical Diversity Of Martian Materials Using Hierarchical Clustering And A Similarity Index For Classification, Michael Conner Bouchard

Arts & Sciences Electronic Theses and Dissertations

We are currently living in the golden age of robotic exploration of Mars, with a continued robotic presence there since 1997. Next to Earth, Mars is the planet about which we have gathered the most geologic information. Unlike Earth, Mars does not appear to have plate tectonics, and the planet’s primary and secondary crust is dominated by basalts. Understanding the compositional diversity of the materials that make up the martian crust will give us a better insight into the geologic processes that formed the planet and its subsequent evolution. One large and growing source of martian surface compositions is the …


Mechanics Of Phenotypic Aging Trajectories In C. Elegans And Humans, William Zhang May 2019

Mechanics Of Phenotypic Aging Trajectories In C. Elegans And Humans, William Zhang

Arts & Sciences Electronic Theses and Dissertations

Overall, my dissertation integrates longitudinal measurements of physiology to investigate the aging process. In the first half, I examine the surprising and largely unexplained degree of variation in lifespan within even homogeneous populations. I sought to understand how physiological aging differs between long- and short-lived individuals within a population of genetically identical C. elegans reared in a homogeneous environment. Using a novel culture apparatus, I longitudinally monitored aspects of aging physiology across a large population of isolated individuals. Aggregating several measures into an overall estimate of senescence, I find that long- and short-lived individuals start adulthood on an equal physiological …


Topics In Complex And Large-Scale Data Analysis, Guanshengrui Hao May 2019

Topics In Complex And Large-Scale Data Analysis, Guanshengrui Hao

Arts & Sciences Electronic Theses and Dissertations

Past few decades have witnessed skyrocketed development of modern technologies. As a result, data collected from modern technologies are evolving towards a direction with more complicated structure and larger scale, driving the traditional data analysis methods to develop and adapt. In this dissertation, we study three statistical issues rising in data with complicated structure and/or in large scale. In Chapter 2, we propose a Bayesian framework via exponential random graph models (ERGM) to estimate the model parameters and network structures for networks with measurement errors; In Chapter 3, we design a novel network sampling algorithm for large-scale networks with community …


Grammar And Variation: Understanding How Cis-Regulatory Information Is Encoded In Mammalian Genomes, Dana Michele King Dec 2018

Grammar And Variation: Understanding How Cis-Regulatory Information Is Encoded In Mammalian Genomes, Dana Michele King

Arts & Sciences Electronic Theses and Dissertations

Understanding how genotype leads to phenotype is key to understand both the development and dysfunction of complex organisms. In the context of regulating the gene expression patterns that contribute to cell identity and function, the goal of my thesis research is to how changes in genome sequence may impact impact gene expression by determining how sequence features contribute to regulatory potential. To accomplish this goal, I first leveraged the key regulatory role of pluripotency transcription factors (TFs) in mouse embryonic stem cells (mESCs) and tested synthetically generated and genomic identified combinations of binding site for four TFs, OCT4, SOX2, KLF4, …


Different Estimation Methods For The Basic Independent Component Analysis Model, Zhenyi An Dec 2018

Different Estimation Methods For The Basic Independent Component Analysis Model, Zhenyi An

Arts & Sciences Electronic Theses and Dissertations

Inspired by classic cocktail-party problem, the basic Independent Component Analysis (ICA) model is created. What differs Independent Component Analysis (ICA) from other kinds of analysis is the intrinsic non-Gaussian assumption of the data. Several approaches are proposed based on maximizing the non-Gaussianity of the data, which is measured by kurtosis, mutual information, and others. With each estimation, we need to optimize the functions of expectations of non-quadratic functions since it can help us to access the higher-order statistics of non-Gaussian part of the data. In this thesis, our goal is to review the one of the most efficient estimation methods, …


Generalized Non-Inferential Approach To Modeling Restricted Discrete Choice For The Case Of The Spatial Random Utility, Elena Labzina Aug 2018

Generalized Non-Inferential Approach To Modeling Restricted Discrete Choice For The Case Of The Spatial Random Utility, Elena Labzina

Arts & Sciences Electronic Theses and Dissertations

Multinomial logistic regression model (MNL) is a powerful and easily tractable way for measuring the probabilistic impact of input variables on individual categorical choices. Crucially, the standard MNL assumes that all subjects of the study have the same choice sets. In the meanwhile, especially in political science and economics, this condition is frequently violated. Probably, the most graphical example of varying choice sets (VCS) is partially contested elections. Furthermore, the MNL implicitly implies the Independence of the Irregular Alternatives (IIA) assumption by requiring i.i.d errors that contrasts the MNL and the multinomial probit (MNP) and mixed logit (MXL) models. In …


Deep Learning Analysis Of Limit Order Book, Xin Xu May 2018

Deep Learning Analysis Of Limit Order Book, Xin Xu

Arts & Sciences Electronic Theses and Dissertations

In this paper, we build a deep neural network for modeling spatial structure in limit order book and make prediction for future best ask or best bid price based on ideas of (Sirignano 2016). We propose an intuitive data processing method to approximate the data is non-available for us based only on level I data that is more widely available. The model is based on the idea that there is local dependence for best ask or best bid price and sizes of related orders. First we use logistic regression to prove that this approach is reasonable. To show the advantages …


Algorithmic Trading With Prior Information, Xinyi Cai May 2018

Algorithmic Trading With Prior Information, Xinyi Cai

Arts & Sciences Electronic Theses and Dissertations

Traders utilize strategies by using a mix of market and limit orders to generate profits. There are different types of traders in the market, some have prior information and can learn from changes in prices to tweak her trading strategy continuously(Informed Traders), some have no prior information but can learn(Uninformed Learners), and some have no prior information and cannot learn(Uninformed Traders). In this thesis. Alvaro C, Sebastian J and Damir K \cite{AL} proposed a model for algorithmic traders to access the impact of dynamic learning in profit and loss in 2014. The traders can employ the model to decide which …


Variable Selection Via Lasso With High-Dimensional Proteomic Data, Hongxuan Zhai May 2018

Variable Selection Via Lasso With High-Dimensional Proteomic Data, Hongxuan Zhai

Arts & Sciences Electronic Theses and Dissertations

Multiclass classification with high-dimensional data is an applied topic both in statistics and machine learning. The classification procedure could be done in various ways. In this thesis, we review the theory of the Lasso procedure which provides a parameter estimator while simultaneously achieving dimension reduction due to a property of the L1 norm. Lasso with elastic net penalty and sparse group lasso are also reviewed. Our data is high-dimensional proteomic data (iTRAQ ratios) of breast cancer patients with four subtypes of breast cancer. We use the multinomial logistic regression to train our classifier and use the false classification rates obtained …


Distributed Quantile Regression Analysis And A Group Variable Selection Method, Liqun Yu May 2018

Distributed Quantile Regression Analysis And A Group Variable Selection Method, Liqun Yu

Arts & Sciences Electronic Theses and Dissertations

This dissertation develops novel methodologies for distributed quantile regression analysis

for big data by utilizing a distributed optimization algorithm called the alternating direction

method of multipliers (ADMM). Specifically, we first write the penalized quantile regression

into a specific form that can be solved by the ADMM and propose numerical algorithms

for solving the ADMM subproblems. This results in the distributed QR-ADMM

algorithm. Then, to further reduce the computational time, we formulate the penalized

quantile regression into another equivalent ADMM form in which all the subproblems have

exact closed-form solutions and hence avoid iterative numerical methods. This results in the

single-loop …


Allocating Interventions Based On Counterfactual Predictions: A Case Study On Homelessness Services, Amanda R. Kube May 2018

Allocating Interventions Based On Counterfactual Predictions: A Case Study On Homelessness Services, Amanda R. Kube

McKelvey School of Engineering Theses & Dissertations

Modern statistical and machine learning methods are increasingly capable of modeling individual or personalized treatment effects by predicting counterfactual outcomes. These counterfactual predictions could be used to allocate different interventions across populations based on individual characteristics. In many domains, like social services, the availability of possible interventions can be severely resource limited. This thesis considers possible improvements to the allocation of such services in the context of homelessness service provision in a major metropolitan area. Using data from the homeless system, I show potential for substantial predicted benefits in terms of reducing the number of families who experience repeat episodes …