Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 12 of 12

Full-Text Articles in Physical Sciences and Mathematics

Quantification Of Various Types Of Biases In Large Language Models, Sudhashree Sayenju Apr 2023

Quantification Of Various Types Of Biases In Large Language Models, Sudhashree Sayenju

Doctor of Data Science and Analytics Dissertations

Natural Language Processing (NLP) systems are included everywhere on the internet from search engines, language translations to more advanced systems like voice assistant and customer service. Since humans are always on the receiving end of NLP technologies, it is very important to analyze whether or not the Large Language Models (LLMs) in use have bias and are therefore unfair. The majority of the research in NLP bias has focused on societal stereotype biases embedded in LLMs. However, our research focuses on all types of biases, namely model class level bias, stereotype bias and domain bias present in LLMs. Model class …


Debiasing Cyber Incidents – Correcting For Reporting Delays And Under-Reporting, Seema Sangari Aug 2022

Debiasing Cyber Incidents – Correcting For Reporting Delays And Under-Reporting, Seema Sangari

Doctor of Data Science and Analytics Dissertations

This research addresses two key problems in the cyber insurance industry – reporting delays and under-reporting of cyber incidents. Both problems are important to understand the true picture of cyber incident rates. While reporting delays addresses the problem of delays in reporting due to delays in timely detection, under-reporting addresses the problem of cyber incidents frequently under-reported due to brand damage, reputation risk and eventual financial impacts.

The problem of reporting delays in cyber incidents is resolved by generating the distribution of reporting delays and fitting modeled parametric distributions on the given domain. The reporting delay distribution was found to …


Novel Instance-Level Weighted Loss Function For Imbalanced Learning, Trent Geisler May 2022

Novel Instance-Level Weighted Loss Function For Imbalanced Learning, Trent Geisler

Doctor of Data Science and Analytics Dissertations

Binary classification using imbalanced datasets remains a challenge. Typically, supervised learning algorithms minimize the binary cross-entropy objective function to determine the final parameter estimates. This objective function assumes an equal class distribution between the minority (i.e. events) and majority (i.e. non-events) classes, which almost never exists in real-world modeling. In the imbalanced data setting, the equal class distribution is grossly violated, and the resulting parameter estimates are biased toward the majority class. To overcome the bias and improve model generalization, we focus on modifying the original binary cross-entropy objective function by uniquely weighting each minority class observation. We base our …


Integrated Machine Learning Approaches To Improve Classification Performance And Feature Extraction Process For Eeg Dataset, Mohammad Masum Jul 2021

Integrated Machine Learning Approaches To Improve Classification Performance And Feature Extraction Process For Eeg Dataset, Mohammad Masum

Doctor of Data Science and Analytics Dissertations

Epileptic seizure or epilepsy is a chronic neurological disorder that occurs due to brain neurons' abnormal activities and has affected approximately 50 million people worldwide. Epilepsy can affect patients’ health and lead to life-threatening emergencies. Early detection of epilepsy is highly effective in avoiding seizures by intervening in treatment. The electroencephalogram (EEG) signal, which contains valuable information of electrical activity in the brain, is a standard neuroimaging tool used by clinicians to monitor and diagnose epilepsy. Visually inspecting the EEG signal is an expensive, tedious, and error-prone practice. Moreover, the result varies with different neurophysiologists for an identical reading. Thus, …


Quantitatively Motivated Model Development Framework: Downstream Analysis Effects Of Normalization Strategies, Jessica M. Rudd Jul 2020

Quantitatively Motivated Model Development Framework: Downstream Analysis Effects Of Normalization Strategies, Jessica M. Rudd

Doctor of Data Science and Analytics Dissertations

Through a review of epistemological frameworks in social sciences, history of frameworks in statistics, as well as the current state of research, we establish that there appears to be no consistent, quantitatively motivated model development framework in data science, and the downstream analysis effects of various modeling choices are not uniformly documented. Examples are provided which illustrate that analytic choices, even if justifiable and statistically valid, have a downstream analysis effect on model results. This study proposes a unified model development framework that allows researchers to make statistically motivated modeling choices within the development pipeline. Additionally, a simulation study is …


Attack And Defense In Security Analytics, Yiyun Zhou May 2020

Attack And Defense In Security Analytics, Yiyun Zhou

Doctor of Data Science and Analytics Dissertations

The security problem has gained increasing awareness due to the various kinds of global threats. Security analytics is the process of using streaming data acquisition, collection, and artificial intelligence algorithms for security monitoring and threat disclosure. In this dissertation work, we utilize practical data-driven security analytics to identify the potential threat and explore the robustness of the machine learning model. We focus on two aspects: (1) Security Analytics: utilize machine learning and statistical analytics tools to identify and resolve the threat in real life, such as cybersecurity, abnormal activities. (2) Analytic Security: Explore the security issues of the machine learning …


Data-Driven Investment Decisions In P2p Lending: Strategies Of Integrating Credit Scoring And Profit Scoring, Yan Wang Apr 2020

Data-Driven Investment Decisions In P2p Lending: Strategies Of Integrating Credit Scoring And Profit Scoring, Yan Wang

Doctor of Data Science and Analytics Dissertations

In this dissertation, we develop and discuss several loan evaluation methods to guide the investment decisions for peer-to-peer (P2P) lending. In evaluating loans, credit scoring and profit scoring are the two widely utilized approaches. Credit scoring aims at minimizing the risk while profit scoring aims at maximizing the profit. This dissertation addresses the strengths and weaknesses of each scoring method by integrating them in various ways in order to provide the optimal investment suggestions for different investors. Before developing the methods for loan evaluation at the individual level, we applied the state-of-the-art method called the Long Short Term Memory (LSTM) …


A Credit Analysis Of The Unbanked And Underbanked: An Argument For Alternative Data, Edwin Baidoo Apr 2020

A Credit Analysis Of The Unbanked And Underbanked: An Argument For Alternative Data, Edwin Baidoo

Doctor of Data Science and Analytics Dissertations

The purpose of this study is to ascertain the statistical and economic significance of non-traditional credit data for individuals who do not have sufficient economic data, collectively known as the unbanked and underbanked. The consequences of not having sufficient economic information often determines whether unbanked and underbanked individuals will receive higher price of credit or be denied entirely. In terms of regulation, there is a strong interest in credit models that will inform policies on how to gradually move sections of the unbanked and underbanked population into the general financial network.

In Chapter 2 of the dissertation, I establish the …


A Novel Penalized Log-Likelihood Function For Class Imbalance Problem, Lili Zhang Mar 2020

A Novel Penalized Log-Likelihood Function For Class Imbalance Problem, Lili Zhang

Doctor of Data Science and Analytics Dissertations

The log-likelihood function is the optimization objective in the maximum likelihood method for estimating models (e.g., logistic regression, neural network). However, its formulation is based on assumptions that the target classes are equally distributed and the overall accuracy is maximized, which do not apply to class imbalance problems (e.g., fraud detection, rare disease diagnoses, customer conversion prediction, cybersecurity, predictive maintenance). When trained on imbalanced data, the resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently …


Ordinal Hyperplane Loss, Bob Vanderheyden Dec 2019

Ordinal Hyperplane Loss, Bob Vanderheyden

Doctor of Data Science and Analytics Dissertations

This research presents the development of a new framework for analyzing ordered class data, commonly called “ordinal class” data. The focus of the work is the development of classifiers (predictive models) that predict classes from available data. Ratings scales, medical classification scales, socio-economic scales, meaningful groupings of continuous data, facial emotional intensity and facial age estimation are examples of ordinal data for which data scientists may be asked to develop predictive classifiers. It is possible to treat ordinal classification like any other classification problem that has more than two classes. Specifying a model with this strategy does not fully utilize …


One And Two-Step Estimation Of Time Variant Parameters And Nonparametric Quantiles, Bogdan Gadidov Jul 2019

One And Two-Step Estimation Of Time Variant Parameters And Nonparametric Quantiles, Bogdan Gadidov

Doctor of Data Science and Analytics Dissertations

This dissertation develops and discusses several one-step and two-step smoothing methods of time variant nonparametric quantiles and time variant parameters from probability models. First, we investigate and develop nonparametric techniques for measuring extreme quantiles. The method involves aggregating data by an explanatory variable such as time and smoothing the resulting data with a nonparametric method like kernel, local polynomial or spline smoothing. We demonstrate both in application and simulation that this two-step procedure of quantile estimation is superior to the parametric quantile regression. We then develop a one-step method which combines the strength of maximum likelihood estimation with a local …


Deep Embedding Kernel, Linh Le Apr 2019

Deep Embedding Kernel, Linh Le

Doctor of Data Science and Analytics Dissertations

Kernel methods and deep learning are two major branches of machine learning that have achieved numerous successes in both analytics and artificial intelligence. While having their own unique characteristics, both branches work through mapping data to a feature space that is supposedly more favorable towards the given task. This dissertation addresses the strengths and weaknesses of each mapping method through combining them and forming a family of novel deep architectures that center around the Deep Embedding Kernel (DEK). In short, DEK is a realization of a kernel function through a newly deep architecture. The mapping in DEK is both implicit …