Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Applied Statistics

South Dakota State University

Conference

Articles 1 - 13 of 13

Full-Text Articles in Physical Sciences and Mathematics

Session 6: Model-Based Clustering Analysis On The Spatial-Temporal And Intensity Patterns Of Tornadoes, Yana Melnykov, Yingying Zhang, Rong Zheng Feb 2024

Session 6: Model-Based Clustering Analysis On The Spatial-Temporal And Intensity Patterns Of Tornadoes, Yana Melnykov, Yingying Zhang, Rong Zheng

SDSU Data Science Symposium

Tornadoes are one of the nature’s most violent windstorms that can occur all over the world except Antarctica. Previous scientific efforts were spent on studying this nature hazard from facets such as: genesis, dynamics, detection, forecasting, warning, measuring, and assessing. While we want to model the tornado datasets by using modern sophisticated statistical and computational techniques. The goal of the paper is developing novel finite mixture models and performing clustering analysis on the spatial-temporal and intensity patterns of the tornadoes. To analyze the tornado dataset, we firstly try a Gaussian distribution with the mean vector and variance-covariance matrix represented as …


Session 6: The Size-Biased Lognormal Mixture With The Entropy Regularized Algorithm, Tatjana Miljkovic, Taehan Bae Feb 2024

Session 6: The Size-Biased Lognormal Mixture With The Entropy Regularized Algorithm, Tatjana Miljkovic, Taehan Bae

SDSU Data Science Symposium

A size-biased left-truncated Lognormal (SB-ltLN) mixture is proposed as a robust alternative to the Erlang mixture for modeling left-truncated insurance losses with a heavy tail. The weak denseness property of the weighted Lognormal mixture is studied along with the tail behavior. Explicit analytical solutions are derived for moments and Tail Value at Risk based on the proposed model. An extension of the regularized expectation–maximization (REM) algorithm with Shannon's entropy weights (ewREM) is introduced for parameter estimation and variability assessment. The left-truncated internal fraud data set from the Operational Riskdata eXchange is used to illustrate applications of the proposed model. Finally, …


Two-Stage Approach For Forensic Handwriting Analysis, Ashlan J. Simpson, Danica M. Ommen Feb 2023

Two-Stage Approach For Forensic Handwriting Analysis, Ashlan J. Simpson, Danica M. Ommen

SDSU Data Science Symposium

Trained experts currently perform the handwriting analysis required in the criminal justice field, but this can create biases, delays, and expenses, leaving room for improvement. Prior research has sought to address this by analyzing handwriting through feature-based and score-based likelihood ratios for assessing evidence within a probabilistic framework. However, error rates are not well defined within this framework, making it difficult to evaluate the method and can lead to making a greater-than-expected number of errors when applying the approach. This research explores a method for assessing handwriting within the Two-Stage framework, which allows for quantifying error rates as recommended by …


A Characterization Of Bias Introduced Into Forensic Source Identification When There Is A Subpopulation Structure In The Relevant Source Population., Dylan Borchert, Semhar Michael, Christopher Saunders Feb 2023

A Characterization Of Bias Introduced Into Forensic Source Identification When There Is A Subpopulation Structure In The Relevant Source Population., Dylan Borchert, Semhar Michael, Christopher Saunders

SDSU Data Science Symposium

In forensic source identification the forensic expert is responsible for providing a summary of the evidence that allows for a decision maker to make a logical and coherent decision concerning the source of some trace evidence of interest. The academic consensus is usually that this summary should take the form of a likelihood ratio (LR) that summarizes the likelihood of the trace evidence arising under two competing propositions. These competing propositions are usually referred to as the prosecution’s proposition, that the specified source is the actual source of the trace evidence, and the defense’s proposition, that another source in a …


Application Of Gaussian Mixture Models To Simulated Additive Manufacturing, Jason Hasse, Semhar Michael, Anamika Prasad Feb 2023

Application Of Gaussian Mixture Models To Simulated Additive Manufacturing, Jason Hasse, Semhar Michael, Anamika Prasad

SDSU Data Science Symposium

Additive manufacturing (AM) is the process of building components through an iterative process of adding material in specific designs. AM has a wide range of process parameters that influence the quality of the component. This work applies Gaussian mixture models to detect clusters of similar stress values within and across components manufactured with varying process parameters. Further, a mixture of regression models is considered to simultaneously find groups and also fit regression within each group. The results are compared with a previous naive approach.


Finite Mixture Modeling For Hierarchically Structured Data With Application To Keystroke Dynamics, Andrew Simpson, Semhar Michael Feb 2023

Finite Mixture Modeling For Hierarchically Structured Data With Application To Keystroke Dynamics, Andrew Simpson, Semhar Michael

SDSU Data Science Symposium

Keystroke dynamics has been used to both authenticate users of computer systems and detect unauthorized users who attempt to access the system. Monitoring keystroke dynamics adds another level to computer security as passwords are often compromised. Keystrokes can also be continuously monitored long after a password has been entered and the user is accessing the system for added security. Many of the current methods that have been proposed are supervised methods in that they assume that the true user of each keystroke is known apriori. This is not always true for example with businesses and government agencies which have internal …


Session 8: Ensemble Of Score Likelihood Ratios For The Common Source Problem, Federico Veneri, Danica M. Ommen Feb 2023

Session 8: Ensemble Of Score Likelihood Ratios For The Common Source Problem, Federico Veneri, Danica M. Ommen

SDSU Data Science Symposium

Machine learning-based Score Likelihood Ratios have been proposed as an alternative to traditional Likelihood Ratios and Bayes Factor to quantify the value of evidence when contrasting two opposing propositions.

Under the common source problem, the opposing proposition relates to the inferential problem of assessing whether two items come from the same source. Machine learning techniques can be used to construct a (dis)similarity score for complex data when developing a traditional model is infeasible, and density estimation is used to estimate the likelihood of the scores under both propositions.

In practice, the metric and its distribution are developed using pairwise comparisons …


An Alpha-Based Prescreening Methodology For A Common But Unknown Source Likelihood Ratio With Different Subpopulation Structures, Dylan Borchert, Semhar Michael, Christopher Saunders, Andrew Simpson Feb 2022

An Alpha-Based Prescreening Methodology For A Common But Unknown Source Likelihood Ratio With Different Subpopulation Structures, Dylan Borchert, Semhar Michael, Christopher Saunders, Andrew Simpson

SDSU Data Science Symposium

Prescreening is a commonly used methodology in which the forensic examiner includes sources from the background population that meet a certain degree of similarity to the given piece of evidence. The goal of prescreening is to find the sources closest to the given piece of evidence in an alternative source population for further analysis. This paper discusses the behavior of an $\alpha-$based prescreening methodology in the form of a Hotelling $T^2$ test on the background population for a common but unknown source likelihood ratio. An extensive simulation study with synthetic and real data were conducted. We find that prescreening helps …


Identifying Subpopulations Of A Hierarchical Structured Data Using A Semi-Supervised Mixture Modeling Approach, Andrew Simpson, Semhar Michael, Christopher Saunders, Dylan Borchert Feb 2022

Identifying Subpopulations Of A Hierarchical Structured Data Using A Semi-Supervised Mixture Modeling Approach, Andrew Simpson, Semhar Michael, Christopher Saunders, Dylan Borchert

SDSU Data Science Symposium

The field of forensic statistics offers a unique hierarchical data structure in which a population is composed of several subpopulations of sources and a sample is collected from each source. This subpopulation structure creates a hierarchical layer. We propose using a semi-supervised mixture modeling approach to model the subpopulation structure which leverages the fact that we know the collection of samples came from the same, yet unknown, source. A simulation study based on a famous glass data was conducted and shows this method performs better than other unsupervised approaches which have been previously used in practice.


Session 5: Equipment Finance Credit Risk Modeling - A Case Study In Creative Model Development & Nimble Data Engineering, Edward Krueger, Landon Thompson, Josh Moore Feb 2022

Session 5: Equipment Finance Credit Risk Modeling - A Case Study In Creative Model Development & Nimble Data Engineering, Edward Krueger, Landon Thompson, Josh Moore

SDSU Data Science Symposium

This presentation will focus first on providing an overview of Channel and the Risk Analytics team that performed this case study. Given that context, we’ll then dive into our approach for building the modeling development data set, techniques and tools used to develop and implement the model into a production environment, and some of the challenges faced upon launch. Then, the presentation will pivot to the data engineering pipeline. During this portion, we will explore the application process and what happens to the data we collect. This will include how we extract & store the data along with how it …


Session 11 - Methods: Bootstrap Control Chart For Pareto Percentiles, Ruth Burkhalter Feb 2020

Session 11 - Methods: Bootstrap Control Chart For Pareto Percentiles, Ruth Burkhalter

SDSU Data Science Symposium

Lifetime percentile is an important indicator of product reliability. However, the sampling distribution of a percentile estimator for any lifetime distribution is not a bell shaped one. As a result, the well-known Shewhart-type control chart cannot be applied to monitor the product lifetime percentiles. In this presentation, Bootstrap control charts based on maximum likelihood estimator (MLE) are proposed for monitoring Pareto percentiles. An intensive simulation study is conducted to compare the performance among the proposed MLE Bootstrap control chart and Shewhart-type control chart.


Session: 4 Multilinear Subspace Learning And Its Applications To Machine Learning, Randy Hoover, Kyle Caudle Dr., Karen Braman Dr. Feb 2019

Session: 4 Multilinear Subspace Learning And Its Applications To Machine Learning, Randy Hoover, Kyle Caudle Dr., Karen Braman Dr.

SDSU Data Science Symposium

Multi-dimensional data analysis has seen increased interest in recent years. With more and more data arriving as 2-dimensional arrays (images) as opposed to 1-dimensioanl arrays (signals), new methods for dimensionality reduction, data analysis, and machine learning have been pursued. Most notably have been the Canonical Decompositions/Parallel Factors (commonly referred to as CP) and Tucker decompositions (commonly regarded as a high order SVD: HOSVD). In the current research we present an alternate method for computing singular value and eigenvalue decompositions on multi-way data through an algebra of circulants and illustrate their application to two well-known machine learning methods: Multi-Linear Principal Component …


Predicting Unplanned Medical Visits Among Patients With Diabetes Using Machine Learning, Arielle Selya, Eric L. Johnson Feb 2019

Predicting Unplanned Medical Visits Among Patients With Diabetes Using Machine Learning, Arielle Selya, Eric L. Johnson

SDSU Data Science Symposium

Diabetes poses a variety of medical complications to patients, resulting in a high rate of unplanned medical visits, which are costly to patients and healthcare providers alike. However, unplanned medical visits by their nature are very difficult to predict. The current project draws upon electronic health records (EMR’s) of adult patients with diabetes who received care at Sanford Health between 2014 and 2017. Various machine learning methods were used to predict which patients have had an unplanned medical visit based on a variety of EMR variables (age, BMI, blood pressure, # of prescriptions, # of diagnoses on problem list, A1C, …