Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 107

Full-Text Articles in Physical Sciences and Mathematics

Integrating Machine Learning Methods For Medical Diagnosis, Jazmin Quezada Dec 2023

Integrating Machine Learning Methods For Medical Diagnosis, Jazmin Quezada

Open Access Theses & Dissertations

Abstract:The rapid advancement of machine learning techniques has revolutionized the field of medical diagnosis by offering powerful tools to analyze complex data sets and make accurate predictions. In this proposed method, we present a novel approach that integrates machine learning and optimization models to enhance the accuracy of medical diagnoses. Our method focuses on fine-tuning and optimizing the parameters of machine learning algorithms commonly used in medical diagnosis, such as logistic regression, support vector machines, and neural networks. By employing optimization techniques, we systematically explore the parameter space of these algorithms to discover the most optimal configurations. Moreover, by representing …


Metrics For Comparison Of Complex Networks, Clarissa Reyes Dec 2023

Metrics For Comparison Of Complex Networks, Clarissa Reyes

Open Access Theses & Dissertations

Heuristic network statistics are used as a preliminary approach to identify change across networks. In networks where there is known node correspondence (KNC), conventional network comparison methods include taking a norm of the difference matrix, or calculating dissimilarity measures like DeltaCon and cut distance. Since different KNC measures provide varying insight to the network comparison problem, we propose employing Rank Score Characteristic Functions (RSCFs) and the rank-score process as a method for reaching a consensus when ranking quantified change across multiple pairs of networks â?? which is particularly useful for ranking change across subpopulations or subgraphs. Additionally, we propose a …


Robust Penalized Density Power Divergence Regression With Scad Penalty For High Dimensional Data Analysis, Maxwell Kwesi Mac-Ocloo Aug 2023

Robust Penalized Density Power Divergence Regression With Scad Penalty For High Dimensional Data Analysis, Maxwell Kwesi Mac-Ocloo

Open Access Theses & Dissertations

Amidst the exponential surge in big data, managing high-dimensional datasets across diverse fields and industries has emerged as a significant challenge. Conventional statistical methods struggle to handle their complexity, making analysis intricate. In response, we've formulated a robust estimator tailored to counter outliers and heavy-tailed errors. Our approach integrates the SCAD penalty into the Density Power Divergence method, effectively reducing insignificant coefficients to zero. This enhances analysis precision and result reliability.We benchmark our robust and penalized model against existing techniques like Huber, Tukey, LASSO, LAD, and LAD-LASSO. Employing both simulated and UCI machine learning repository datasets, we assess method performance …


Single-Index Multinomial Model For Analyzing Crime Data, Kwabena Gyamfi Duodu Aug 2023

Single-Index Multinomial Model For Analyzing Crime Data, Kwabena Gyamfi Duodu

Open Access Theses & Dissertations

We develop a flexible single-index multinomial model for analyzing crime data. In additionto the number of crimes reported, the data also includes covariates such as location, time of day, weather, and other demographic factors. We provide an estimation algorithm and develop R code for the single-index multinomial model. Using simulations, we evaluate the performance of the proposed estimation algorithm. When applied to crime data, the single-index multinomial model provides important insights into crime trends and risk variables, assisting in the development of tailored crime prevention programs. Policymakers and law enforcement organizations can use the model's projections to more efficiently allocate …


Robust Mahalanobis K-Means Algorithm In Comparison With Other Existing Clustering Methods., Eleazer Tabi Serebour Aug 2023

Robust Mahalanobis K-Means Algorithm In Comparison With Other Existing Clustering Methods., Eleazer Tabi Serebour

Open Access Theses & Dissertations

This study enhances K-means Mahalanobis clustering using Density Power Divergence (DPD) for outlier handling and detection. Through the utilization of simulations and the analysis of real-world data, our approach consistently outperforms standard K-means, Mahalanobis K-means, Fuzzy C-means, and others in clustering datasets with outliers. While our method performs similarly to others on spherical datasets, it ranks second to DBSCAN for arbitrary shapes. We showcase its superiority on real-life datasets (Iris flower and wheat seed), demonstrating resilient outlier identification. By navigating various structures and cluster characteristics, our Modified Mahalanobis K-means method proves adaptable and robust, offering insights into diverse clustering scenarios. …


Comparative Study Of Supervised Classification Techniques With A Modified Knn Algorithm, Noah Owusu Aug 2023

Comparative Study Of Supervised Classification Techniques With A Modified Knn Algorithm, Noah Owusu

Open Access Theses & Dissertations

The goal of classification is to develop a model that can be used to accurately assign new observations to labeled classes based on the patterns learned from the training data. K-nearest Neighbors algorithm (KNN) is a popular and widely used algorithm for classification, however, its performance can be adversely affected by the presence of outliers in a dataset. In this study we have modified this existing KNN algorithm that can alleviate the effect of outliers in a dataset, thereby improving the performance of the KNN algorithm. We compared the performances of the Modified KNN method and the Existing KNN algorithm …


Comparison Of Different Robust Methods In Linear Regression And Applications In Cardiovascular Data, Jagannath Das May 2023

Comparison Of Different Robust Methods In Linear Regression And Applications In Cardiovascular Data, Jagannath Das

Open Access Theses & Dissertations

Due to advanced technology and wide source of data collection, high-dimensional data is available in several fields, including healthcare, bioinformatics, medicine, epidemiology, economics, finance, sociology, and climatology. In those datasets, outliers are generally encountered due to technical errors, heterogeneous sources, or the effect of some confounding variables. As outliers are often difficult to detect in high-dimensional data, the standard approaches may fail to model such data and produce misleading information. In this thesis, we studied Huber and Tukey's M-estimators for linear regression that automatically down-weight outliers and provide a good fit. We also investigated two variable selection methods -- LASSO …


Nonparametric Estimation Of Elliptical Copulas, Panfeng Liang May 2023

Nonparametric Estimation Of Elliptical Copulas, Panfeng Liang

Open Access Theses & Dissertations

Elliptical copulas provide flexibility in modeling the dependence structure of a random vector. They are often parameterized with a correlation matrix and a scalar function, called generator. The estimation of the generator can be challenging, because it is a functional parameter. In this dissertation, we provide a rigorous approach to estimating the generator in a Bayesian framework, which is simpler, more robust, and outperforms existing estimation methods in the literature. Based on the proposed framework in this dissertation, other researchers may modify the model for other types of generators in their own research.


Theoretical And Computational Aspects Of Robust Cluster Analysis For Multivariate And High-Dimensional Datasets, Andrews Tawiah Anum May 2023

Theoretical And Computational Aspects Of Robust Cluster Analysis For Multivariate And High-Dimensional Datasets, Andrews Tawiah Anum

Open Access Theses & Dissertations

Multivariate and high-dimensional datasets typically contain subgroups that may not be immediately apparent. To reveal these groups, cluster analysis is performed. Cluster analysis is an unsupervised machine learning technique commonly employed to partition a dataset into distinct categories referred to as clusters. The k-means algorithm is a prominent distance-based clustering method. Despite overwhelming popularity, the algorithm is not invariant under non-singular affine transformations and is not robust, i.e., can be unduly influenced by outliers. To address these deficiencies, we propose an alternative model-based clustering procedure by minimizing a “trimmed” variant of the negative log-likelihood function. We develop a “concentration step”, …


Performance Classification Of Ornstein-Uhlenbeck-Type Models Using Fractal Analysis Of Time Series Data., Peter Kwadwo Asante May 2023

Performance Classification Of Ornstein-Uhlenbeck-Type Models Using Fractal Analysis Of Time Series Data., Peter Kwadwo Asante

Open Access Theses & Dissertations

This dissertation aims to assess the performance of Ornstein-Uhlenbeck-type models by examining the fractal characteristics of time series data from various sources, including finance, volcanic and earthquake events, US COVID-19 reported cases and deaths, and two simulated time series with differing properties. The time series data is categorized as either a Gaussian or a Lévy process (Lévy walk or Lévy flight) by using three scaling methods: Rescaled range analysis, Detrended fluctuation analysis, and Diffusion entropy analysis. The outcomes of this analysis indicate that the financial indices are classified as Lévy walks, while the volcanic, earthquake, and COVID-19 data are classified …


Flexible Models For The Estimation Of Treatment Effect, Habeeb Abolaji Bashir May 2023

Flexible Models For The Estimation Of Treatment Effect, Habeeb Abolaji Bashir

Open Access Theses & Dissertations

Estimation of treatment effect is an important problem which is well studied in the literature. While the regression models are one of the most commonly used techniques for the estimation of treatment effect, they are prone to model misspecification. To minimize the model misspecification bias, flexible nonparametric models are introduced for the estimation. Continuing this line of research, we propose two flexible nonparametric models that allow the treatment effect to vary across different levels of covariates. We provide estimation algorithms for both these models. Using simulations and data analysis, we illustrate the usefulness of the proposed methods.


Developing A Risk Assessment Instrument For Immigration Cases Under Federal Supervision, Mayra Eydie Pacheco May 2023

Developing A Risk Assessment Instrument For Immigration Cases Under Federal Supervision, Mayra Eydie Pacheco

Open Access Theses & Dissertations

No abstract provided.


Outlier Detection In Multivariate And High-Dimensional Datasets, Yuanhong Wu May 2023

Outlier Detection In Multivariate And High-Dimensional Datasets, Yuanhong Wu

Open Access Theses & Dissertations

Accurate detection of outliers is crucial in the field of statistical analysis. Using classical statisticalmodels without considering the presence of outliers in the data can lead to misleading outcomes. There exist a myriad of procedures to detect outliers in statistics. We concentrate on the statistical techniques that can robustly identify outliers in data sets. To this end, we pursue two aims. First, we give an extensive overview of robust statistical methods which are still popular in recent years for outlier detection. We provide the definitions, algorithms and also discuss some important properties of these methods. Second, two real examples are …


Spatially Adaptive Estimation Of Spectrum, Yi Xie May 2023

Spatially Adaptive Estimation Of Spectrum, Yi Xie

Open Access Theses & Dissertations

A time series may be analyzed either in the time or in the frequency domain. When working in the frequency domain, the main objective is to estimate the underlying spectrum. Various approaches have been proposed to this end, but most are based on smoothing the periodogram using a single smoothing parameter across all Fourier frequencies. Such a global smoothing parameter may result in a biased estimate. To improve the estimation, in this paper, we smooth the log periodogram by placing a dynamic shrinkage prior, such that varying degrees of smoothing may be applied to different regions of the Fourier frequencies, …


Generalized Additive Model Using Marginal Integration Estimation Techniques With Interactions, Tahiru Mahama May 2023

Generalized Additive Model Using Marginal Integration Estimation Techniques With Interactions, Tahiru Mahama

Open Access Theses & Dissertations

Marginal Integration (MI) is a statistical method that is extensively employed to estimatecomponent functions of the nonparametric additive models. The shortcoming of the purely additive model is that interaction between predictor variables is often ignored, and it may produce poor performance in some real applications. As a result, this research considers the second-order interactions in the regression models. The primary objective is to use marginal integration techniques to estimate the nonparametric additive functions. We compare this model with other models/estimators such as the Generalized Additive Model (GAM), Generalized Additive Model with Selection (GAMSEL), Robust Marginal Integration (RMI), Ordinary Least Squares …


Evaluation Of Effect Of Preprocessing Algorithms On Resting State Fmri Data, Hortencia Josefina Hernandez Dec 2022

Evaluation Of Effect Of Preprocessing Algorithms On Resting State Fmri Data, Hortencia Josefina Hernandez

Open Access Theses & Dissertations

Graph theory modeling is a common modeling approach in neurobiology research studies. These models are useful since they describe patterns of connection for regions of interest in the brain using resting state fMRI images. The standard rule of thumb is to threshold the observed activation levels prior to model building. It is reasonable to assume that the use of this threshold affects the statistical distribution of commonly reported centrality metrics from the graph theory model, such as degree, betweenness, and closeness. In this study we examine the differential effect of using the standard approaches versus alternative direct thresholds and incorporation …


A Computationally Efficient Wald Test In M-Estimation, Denisse Urenda Castañeda Aug 2022

A Computationally Efficient Wald Test In M-Estimation, Denisse Urenda Castañeda

Open Access Theses & Dissertations

Under the maximum likelihood framework, three asymptotic overall tests have been well developed in generalized linear models (GLM) for testing the single null hypothesis H0 : θ = θ0, namely, the Wald test, Likelihood Ratio Test (LRT) and Score test also known as the Lagrange Multiplier test (LM). Modified versions of Wald, LR and LM tests can also be found for testing the significance of a portion of the parameter θ, i.e., if θ = (θ T 1 , θ T 2 ) T it is of interest to test H0 : θ2 = 0. However, with the constant increase …


Efficient Approaches To Steady State Detection In Multivariate Systems, Honglun Xu Aug 2022

Efficient Approaches To Steady State Detection In Multivariate Systems, Honglun Xu

Open Access Theses & Dissertations

Steady state detection is critically important in many engineering fields such as fault detection and diagnosis, process monitoring and control. However, most of the existing methods are designed for univariate signals. In this dissertation, we proposed an efficient online steady state detection method for multivariate systems through a sequential Bayesian partitioning approach. The signal is modeled by a Bayesian piecewise constant mean and covariance model, and a recursive updating method is developed to calculate the posterior distributions analytically. The duration of the current segment is utilized to test the steady state. Insightful guidance is provided for hyperparameter selection. The effectiveness …


Developing And Applying Computational Algorithms To Reveal Health-Related Biomolecular Interactions, Yixin Xie May 2022

Developing And Applying Computational Algorithms To Reveal Health-Related Biomolecular Interactions, Yixin Xie

Open Access Theses & Dissertations

Computational biology is an interdisciplinary area that applies computational approaches in biological big data, including protein amino acid sequences, genetic sequences, etc., which is widely used to analyze protein-protein interactions, make predictions in drug discovery, develop vaccines, etc. Popular methods include mathematical modeling, molecular dynamics simulations, data science mythology, etc. With the help of computational algorithms and applications, drug development is much faster than traditional processes, as it reduces risks early on in a drug discovery process and helps researchers select target candidates that have the highest potential for success. In my doctoral research, I applied multi-scale computational approaches to …


A Machine Learning Approach To Stochastic Optimal Control, Pablo Ever Avalos May 2022

A Machine Learning Approach To Stochastic Optimal Control, Pablo Ever Avalos

Open Access Theses & Dissertations

Merton's portfolio optimization problem is a well-renowned problem in financial mathematics which seeks to optimize the investment decision for an investor. In the simplest situation, the market consists of a risk-less asset (i.e. a bond) that pays back a relatively low interest rate, and a risky asset (i.e. a stock) that follows a geometric Brownian motion. The optimal allocation strategy of the investor's wealth is found by optimizing the expected utility along the stochastic evolution of the market. This thesis focuses on several different applications of this optimization problem. We look at pre-constructed analytical solutions and showcase the results. We …


The Physiological Factors Of Diabetes And Their Effect On The Cognitive And Emotional Functioning In Older Populations: A Secondary Data Analysis, Celeste Anahi Alvidrez Dec 2021

The Physiological Factors Of Diabetes And Their Effect On The Cognitive And Emotional Functioning In Older Populations: A Secondary Data Analysis, Celeste Anahi Alvidrez

Open Access Theses & Dissertations

Background: The rates of Type 2 Diabetes (T2D) have increased over the past 20 years in all age groups. The physiological factors that underlie T2D could have impact on specific brain pathways that support cognitive and emotional functioning. Aims and Objective: The goal of this study was to examine whether older Mexican American individuals with a history of T2D were more likely to develop later cognitive impairment and/or depression. Hypotheses: It was predicted that elderly participants (mean age at time of interview = 87.87 years) with a history of T2D onset prior to age 65, are more likely to have …


A New Algorithm For Robust Affine-Invariant Clustering, Andrews Tawiah Anum Dec 2021

A New Algorithm For Robust Affine-Invariant Clustering, Andrews Tawiah Anum

Open Access Theses & Dissertations

Cluster analysis is an unsupervised machine learning technique commonly employed to partition a dataset into distinct categories referred to as clusters. The k-means algorithm is a prominent distance-based clustering method. Despite its overwhelming popularity, the algorithm is not invariant under non-singular linear transformations and is not robust, i.e., can be unduly influenced by outliers. To address these deficiencies, we propose an alternative clustering procedure based on minimizing a “trimmed” variant of the negative log-likelihood function. We develop a “concentration step”, vaguely reminiscent of the classical Lloyd’s algorithm, that can iteratively reduce the objective function. Multiple real and synthetic datasets are …


Statistical Analysis Of Genetic Sequence Variants In Whole Exome Sequencing Data From Patients With Prostate Cancer, Kelvin Ofori-Minta Aug 2021

Statistical Analysis Of Genetic Sequence Variants In Whole Exome Sequencing Data From Patients With Prostate Cancer, Kelvin Ofori-Minta

Open Access Theses & Dissertations

A single variation in the genetic sequence within the DNA of an organism could easily lead to beneficial, detrimental or neutral effects. Most often than not, these effects are detrimental than beneficial. While many biomedical and bioinformatics studies have been conducted to determine the genetic cause of prostate cancer (PrCa) which is still the second leading cause of cancer related death among men in the United States. An appreciable effort in statistical bioinformatics researches has been directed towards this aim. Through statistical analyses of a set of whole exome sequencing data from patients with PrCa obtained via The Cancer Genome …


The Hybridizing Ions Treatment (Hit) Method Development And Computational Study On Sars-Cov-2 E Protein., Shengjie Sun May 2021

The Hybridizing Ions Treatment (Hit) Method Development And Computational Study On Sars-Cov-2 E Protein., Shengjie Sun

Open Access Theses & Dissertations

Fast and accurate calculations of the electrostatic features for highly charged biomolecules such as DNA, RNA, highly charged proteins, are crucial but challenging tasks. Traditional implicit solvent methods calculate the electrostatic features fast, but they are not able to balance the high net charges in the biomolecules effectively. Explicit solvent methods add unbalanced ions to neutralize the highly charged biomolecules in molecular dynamic simulations, which require more expensive computing resources. Here we developed a novel method, the Hybridizing Ions Treatment (HIT) method, which hybridizes the implicit solvent method with the explicit method to realistically calculate the electrostatic potential for highly …


Making Valid Inferences With Decision Tree, George Ekow Quaye May 2021

Making Valid Inferences With Decision Tree, George Ekow Quaye

Open Access Theses & Dissertations

HypoThesis testing and Confidence Interval (CI) estimates are key statistics in predicting future values in data analysis. Most often, CI estimates are directly obtained from the summary statistics of a particular statistical methodology output. However, when it comes to the summary of decision tree outputs, these CI estimates are not directly obtained. So a na\"{i}ve way of making node-level inference is to construct a $(1-\alpha) \times 100\%$ confidence interval for a node mean $\bar{y}_t$ using the relation: $\bar{y}_t \, \pm \, z_{1-\alpha/2} \, \frac{s_t}{\sqrt{n_t}}$, where $\bar{y}_t$ is the node mean and $s_t$ is the standard deviation estimates from the decision …


Refined Moderation Analysis With Binary Outcomes, Eric Anto May 2021

Refined Moderation Analysis With Binary Outcomes, Eric Anto

Open Access Theses & Dissertations

With the growing interest in personalized or precision medicine, it is indispensable thatmoderation analysis which is primarily related to the study of differential treatment effects among patients with different characteristics, also serves as the bedrock for precision medicine is taken more seriously. Concerning moderation analysis with binary outcomes, we start with an interesting observation, which shows that heterogeneous treatment effects could be equivalently estimated via a role exchange between the outcome and the treatment variable. The result holds for both experimental data and observational data, yet with an important difference in interpretation. Two estimators of moderating effects corresponding to two …


Robust Variable Selection In Multiple Linear Regression Via Penalized Least Trimmed Squares., Reagan Kesseku May 2021

Robust Variable Selection In Multiple Linear Regression Via Penalized Least Trimmed Squares., Reagan Kesseku

Open Access Theses & Dissertations

Variable selection has been studied using different approaches. Its growing importance lies in numerous applications to high-dimensional data from experiments and natural phenomena. Often, models are to be constructed from such data based on significant variables for estimation or prediction purposes. This demands not just any variable selectionmethod, but one that is robust, computationally efficient and with other desirable statistical properties. Besides the high-dimensionality of such data, the presence of outliers is common due to heterogeneous sources. Though outliers often contain useful information, they can unduly influence non-robust estimators to produce misleading results. This is the case for ordinary least …


Gene Selection And Classification In High-Throughput Biological Data With Integrated Machine Learning Algorithms And Bioinformatics Approaches, Abhijeet R Patil May 2021

Gene Selection And Classification In High-Throughput Biological Data With Integrated Machine Learning Algorithms And Bioinformatics Approaches, Abhijeet R Patil

Open Access Theses & Dissertations

With the rise of high throughput technologies in biomedical research, large volumes of expression profiling, methylation profiling, and RNA-sequencing data are being generated. These high-dimensional data have large number of features with small number of samples, a characteristic called the "curse of dimensionality." The selection of optimal features, which largely affects the performance of classification algorithms in machine learning models, has led to challenging problems in bioinformatics analyses of such high-dimensional datasets. In this work, I focus on the design of two-stage frameworks of feature selection and classification and their applications in multiple sets of colorectal cancer data. The first …


High-Dimensional Random Forests, Roland Fiagbe May 2021

High-Dimensional Random Forests, Roland Fiagbe

Open Access Theses & Dissertations

The significant advances in technology have enabled easy collection and management of high-dimensional data in many fields, however, the process of modeling these data imposes a huge problem in the field of data science. Dealing with high-dimensional data is one of the significant challenges that degenerate the performance and precision of most classification and regression algorithms, e.g., random forests. Random Forest (RF) is among the few methods that can be extended to model high-dimensional data; nevertheless, its performance and precision, like others, are highly affected by high dimensions, especially when the dataset contains a huge number of noise or noninformative …


Two Pens In A Pocket Must Be Different: A Nerd-Oriented Lesson From Statistics, Olga Kosheleva, Vladik Kreinovich Jul 2020

Two Pens In A Pocket Must Be Different: A Nerd-Oriented Lesson From Statistics, Olga Kosheleva, Vladik Kreinovich

Departmental Technical Reports (CS)

Some people always carry a pen with them, so that if an idea comes to mind, they will always be able to write it down. Pens sometimes run out of ink. So, just in case, people carry two pens. The problem is that often, when one carries two identical pens, they seem to run out of ink at about the same time -- which defeats the whole purpose of carrying two pens. In this paper, we provide a simple statistics-based explanation of this phenomenon, and show that a seemingly natural idea of carrying three pens will not help. The only …