Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Machine learning

Statistical Models

Institution
Publication Year
Publication
Publication Type
File Type

Articles 1 - 28 of 28

Full-Text Articles in Physical Sciences and Mathematics

Differentiation Of Human, Dog, And Cat Hair Fibers Using Dart Tofms And Machine Learning, Laura Ahumada, Erin R. Mcclure-Price, Chad Kwong, Edgard O. Espinoza, John Santerre Dec 2023

Differentiation Of Human, Dog, And Cat Hair Fibers Using Dart Tofms And Machine Learning, Laura Ahumada, Erin R. Mcclure-Price, Chad Kwong, Edgard O. Espinoza, John Santerre

SMU Data Science Review

Hair is found in over 90% of crime scenes and has long been analyzed as trace evidence. However, recent reviews of traditional hair fiber analysis techniques, primarily morphological examination, have cast doubt on its reliability. To address these concerns, this study employed machine learning algorithms, specifically Linear Discriminant Analysis (LDA) and Random Forest, on Direct Analysis in Real Time time-of-flight mass spectra collected from human, cat, and dog hair samples. The objective was to develop a chemistry- and statistics-based classification method for unbiased taxonomic identification of hair. The results of the study showed that LDA and Random Forest were highly …


Comparison Of Sampling Methods For Predicting Wine Quality Based On Physicochemical Properties, Robert Burigo, Scott Frazier, Eli Kravez, Nibhrat Lohia Apr 2023

Comparison Of Sampling Methods For Predicting Wine Quality Based On Physicochemical Properties, Robert Burigo, Scott Frazier, Eli Kravez, Nibhrat Lohia

SMU Data Science Review

Using the physicochemical properties of wine to predict quality has been done in numerous studies. Given the nature of these properties, the data is inherently skewed. Previous works have focused on handful of sampling techniques to balance the data. This research compares multiple sampling techniques in predicting the target with limited data. For this purpose, an ensemble model is used to evaluate the different techniques. There was no evidence found in this research to conclude that there are specific oversampling methods that improve random forest classifier for a multi-class problem.


Applications Of Transfer Learning From Malicious To Vulnerable Binaries, Sean Patrick Mcnulty Jan 2023

Applications Of Transfer Learning From Malicious To Vulnerable Binaries, Sean Patrick Mcnulty

Graduate Student Theses, Dissertations, & Professional Papers

Malware detection and vulnerability detection are important cybersecurity tasks. Previous research has successfully applied a variety of machine learning methods to both. However, despite their potential synergies, previous research has yet to unite these two tasks. Given the recent success of transfer learning in many domains, such as language modeling and image recognition, this thesis investigated the use of transfer learning to improve vulnerability detection. Specifically, we pre-trained a series of models to detect malicious binaries and used the weights from those models to kickstart the detection of vulnerable binaries. In our study, we also investigated five different data representations …


Applications Of Machine Learning Algorithms In Materials Science And Bioinformatics, Mohammed Quazi Jun 2022

Applications Of Machine Learning Algorithms In Materials Science And Bioinformatics, Mohammed Quazi

Mathematics & Statistics ETDs

The piezoelectric response has been a measure of interest in density functional theory (DFT) for micro-electromechanical systems (MEMS) since the inception of MEMS technology. Piezoelectric-based MEMS devices find wide applications in automobiles, mobile phones, healthcare devices, and silicon chips for computers, to name a few. Piezoelectric properties of doped aluminum nitride (AlN) have been under investigation in materials science for piezoelectric thin films because of its wide range of device applicability. In this research using rigorous DFT calculations, high throughput ab-initio simulations for 23 AlN alloys are generated.

This research is the first to report strong enhancements of piezoelectric properties …


A Course In Data Science: R And Prediction Modeling, Adam Kapelner May 2022

A Course In Data Science: R And Prediction Modeling, Adam Kapelner

Open Educational Resources

This is a self-contained course in data science and machine learning using R. It covers philosophy of modeling with data, prediction via linear models, machine learning including support vector machines and random forests, probability estimation and asymmetric costs using logistic regression and probit regression, underfitting vs. overfitting, model validation, handling missingness and much more. There is formal instruction of data manipulation using dplyr and data.table, visualization using ggplot2 and statistical computing.


Early-Warning Alert Systems For Financial-Instability Detection: An Hmm-Driven Approach, Xing Gu Apr 2022

Early-Warning Alert Systems For Financial-Instability Detection: An Hmm-Driven Approach, Xing Gu

Electronic Thesis and Dissertation Repository

Regulators’ early intervention is crucial when the financial system is experiencing difficulties. Financial stability must be preserved to avert banks’ bailouts, which hugely drain government's financial resources. Detecting in advance periods of financial crisis entails the development and customisation of accurate and robust quantitative techniques. The goal of this thesis is to construct automated systems via the interplay of various mathematical and statistical methodologies to signal financial instability episodes in the near-term horizon. These signal alerts could provide regulatory bodies with the capacity to initiate appropriate response that will thwart or at least minimise the occurrence of a financial crisis. …


Intra-Hour Solar Forecasting Using Cloud Dynamics Features Extracted From Ground-Based Infrared Sky Images, Guillermo Terrén-Serrano Apr 2022

Intra-Hour Solar Forecasting Using Cloud Dynamics Features Extracted From Ground-Based Infrared Sky Images, Guillermo Terrén-Serrano

Electrical and Computer Engineering ETDs

Due to the increasing use of photovoltaic systems, power grids are vulnerable to the projection of shadows from moving clouds. An intra-hour solar forecast provides power grids with the capability of automatically controlling the dispatch of energy, reducing the additional cost for a guaranteed, reliable supply of energy (i.e., energy storage). This dissertation introduces a novel sky imager consisting of a long-wave radiometric infrared camera and a visible light camera with a fisheye lens. The imager is mounted on a solar tracker to maintain the Sun in the center of the images throughout the day, reducing the scattering effect produced …


Finding The Best Predictors For Foot Traffic In Us Seafood Restaurants, Isabel Paige Beaulieu Jan 2022

Finding The Best Predictors For Foot Traffic In Us Seafood Restaurants, Isabel Paige Beaulieu

Honors Theses and Capstones

COVID-19 caused state and nation-wide lockdowns, which altered human foot traffic, especially in restaurants. The seafood sector in particular suffered greatly as there was an increase in illegal fishing, it is made up of perishable goods, it is seasonal in some places, and imports and exports were slowed. Foot traffic data is useful for business owners to have to know how much to order, how many employees to schedule, etc. One issue is that the data is very expensive, hard to get, and not available until months after it is recorded. Our goal is to not only find covariates that …


A Non-Deterministic Deep Learning Based Surrogate For Ice Sheet Modeling, Hannah Jordan Jan 2022

A Non-Deterministic Deep Learning Based Surrogate For Ice Sheet Modeling, Hannah Jordan

Graduate Student Theses, Dissertations, & Professional Papers

Surrogate modeling is a new and expanding field in the world of deep learning, providing a computationally inexpensive way to approximate results from computationally demanding high-fidelity simulations. Ice sheet modeling is one of these computationally expensive models, the model used in this study currently requires between 10 and 20 minutes to complete one simulation. While this process is adequate for certain applications, the ability to use sampling approaches to perform statistical inference becomes infeasible. This issue can be overcome by using a surrogate model to approximate the ice sheet model, bringing the time to produce output down to a tenth …


Comparing Machine Learning Techniques With State-Of-The-Art Parametric Prediction Models For Predicting Soybean Traits, Susweta Ray Dec 2021

Comparing Machine Learning Techniques With State-Of-The-Art Parametric Prediction Models For Predicting Soybean Traits, Susweta Ray

Department of Statistics: Dissertations, Theses, and Student Work

Soybean is a significant source of protein and oil, and also widely used as animal feed. Thus, developing lines that are superior in terms of yield, protein and oil content is important to feed the ever-growing population. As opposed to the high-cost phenotyping, genotyping is both cost and time efficient for breeders while evaluating new lines in different environments (location-year combinations) can be costly. Several Genomic prediction (GP) methods have been developed to use the marker and environment data effectively to predict the yield or other relevant phenotypic traits of crops. Our study compares a conventional GP method (GBLUP), a …


Applications Of Machine Learning In High-Frequency Trade Direction Classification, Jared E. Hansen May 2020

Applications Of Machine Learning In High-Frequency Trade Direction Classification, Jared E. Hansen

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

The correct assignment of trades as buyer-initiated or seller-initiated is paramount in many quantitative finance studies. Simple decision rule methods have been used for signing trades since many data sets available to researchers do not include the sign of each trade executed. By utilizing these decision rule methods, as well as engineering new variables from available data, we have demonstrated that machine learning models outperform prior methods for accurately signing trades as buys and sells, achieving state-of-the-art results. The best model developed was 4.5 percentage points more accurate than older methods when predicting onto unseen data. Since finance and economics …


Data-Driven Investment Decisions In P2p Lending: Strategies Of Integrating Credit Scoring And Profit Scoring, Yan Wang Apr 2020

Data-Driven Investment Decisions In P2p Lending: Strategies Of Integrating Credit Scoring And Profit Scoring, Yan Wang

Doctor of Data Science and Analytics Dissertations

In this dissertation, we develop and discuss several loan evaluation methods to guide the investment decisions for peer-to-peer (P2P) lending. In evaluating loans, credit scoring and profit scoring are the two widely utilized approaches. Credit scoring aims at minimizing the risk while profit scoring aims at maximizing the profit. This dissertation addresses the strengths and weaknesses of each scoring method by integrating them in various ways in order to provide the optimal investment suggestions for different investors. Before developing the methods for loan evaluation at the individual level, we applied the state-of-the-art method called the Long Short Term Memory (LSTM) …


Evaluating An Ordinal Output Using Data Modeling, Algorithmic Modeling, And Numerical Analysis, Martin Keagan Wynne Brown Jan 2020

Evaluating An Ordinal Output Using Data Modeling, Algorithmic Modeling, And Numerical Analysis, Martin Keagan Wynne Brown

Murray State Theses and Dissertations

Data and algorithmic modeling are two different approaches used in predictive analytics. The models discussed from these two approaches include the proportional odds logit model (POLR), the vector generalized linear model (VGLM), the classification and regression tree model (CART), and the random forests model (RF). Patterns in the data were analyzed using trigonometric polynomial approximations and Fast Fourier Transforms. Predictive modeling is used frequently in statistics and data science to find the relationship between the explanatory (input) variables and a response (output) variable. Both approaches prove advantageous in different cases depending on the data set. In our case, the data …


Habitat Associations And Reproduction Of Fishes On The Northwestern Gulf Of Mexico Shelf Edge, Elizabeth Marie Keller Nov 2019

Habitat Associations And Reproduction Of Fishes On The Northwestern Gulf Of Mexico Shelf Edge, Elizabeth Marie Keller

LSU Doctoral Dissertations

Several of the northwestern Gulf of Mexico (GOM) shelf-edge banks provide critical hard bottom habitat for coral and fish communities, supporting a wide diversity of ecologically and economically important species. These sites may be fish aggregation and spawning sites and provide important habitat for fish growth and reproduction. Already designated as habitat areas of particular concern, many of these banks are also under consideration for inclusion in the expansion of the Flower Garden Banks National Marine Sanctuary. This project aimed to gain a more comprehensive understanding of the communities and fish species on shelf-edge banks by way of gonad histology, …


Semi-Supervised Regression With Generative Adversarial Networks Using Minimal Labeled Data, Greg Olmschenk Sep 2019

Semi-Supervised Regression With Generative Adversarial Networks Using Minimal Labeled Data, Greg Olmschenk

Dissertations, Theses, and Capstone Projects

This work studies the generalization of semi-supervised generative adversarial networks (GANs) to regression tasks. A novel feature layer contrasting optimization function, in conjunction with a feature matching optimization, allows the adversarial network to learn from unannotated data and thereby reduce the number of labels required to train a predictive network. An analysis of simulated training conditions is performed to explore the capabilities and limitations of the method. In concert with the semi-supervised regression GANs, an improved label topology and upsampling technique for multi-target regression tasks are shown to reduce data requirements. Improvements are demonstrated on a wide variety of vision …


Machine Learning In Support Of Electric Distribution Asset Failure Prediction, Robert D. Flamenbaum, Thomas Pompo, Christopher Havenstein, Jade Thiemsuwan Aug 2019

Machine Learning In Support Of Electric Distribution Asset Failure Prediction, Robert D. Flamenbaum, Thomas Pompo, Christopher Havenstein, Jade Thiemsuwan

SMU Data Science Review

In this paper, we present novel approaches to predicting as- set failure in the electric distribution system. Failures in overhead power lines and their associated equipment in particular, pose significant finan- cial and environmental threats to electric utilities. Electric device failure furthermore poses a burden on customers and can pose serious risk to life and livelihood. Working with asset data acquired from an electric utility in Southern California, and incorporating environmental and geospatial data from around the region, we applied a Random Forest methodology to predict which overhead distribution lines are most vulnerable to fail- ure. Our results provide evidence …


Statistical And Machine Learning Methods Evaluated For Incorporating Soil And Weather Into Corn Nitrogen Recommendations, Curtis J. Ransom, Newell R. Kitchen, James J. Camberato, Paul R. Carter, Richard B. Ferguson, Fabián G. Fernández, David W. Franzen, Carrie A. M. Laboski, D. Brenton Myers, Emerson D. Nafziger, John E. Sawyer, John F. Shanahan Aug 2019

Statistical And Machine Learning Methods Evaluated For Incorporating Soil And Weather Into Corn Nitrogen Recommendations, Curtis J. Ransom, Newell R. Kitchen, James J. Camberato, Paul R. Carter, Richard B. Ferguson, Fabián G. Fernández, David W. Franzen, Carrie A. M. Laboski, D. Brenton Myers, Emerson D. Nafziger, John E. Sawyer, John F. Shanahan

John E. Sawyer

Nitrogen (N) fertilizer recommendation tools could be improved for estimating corn (Zea mays L.) N needs by incorporating site-specific soil and weather information. However, an evaluation of analytical methods is needed to determine the success of incorporating this information. The objectives of this research were to evaluate statistical and machine learning (ML) algorithms for utilizing soil and weather information for improving corn N recommendation tools. Eight algorithms [stepwise, ridge regression, least absolute shrinkage and selection operator (Lasso), elastic net regression, principal component regression (PCR), partial least squares regression (PLSR), decision tree, and random forest] were evaluated using a dataset …


A Data-Driven Approach For Modeling Agents, Hamdi Kavak Apr 2019

A Data-Driven Approach For Modeling Agents, Hamdi Kavak

Computational Modeling & Simulation Engineering Theses & Dissertations

Agents are commonly created on a set of simple rules driven by theories, hypotheses, and assumptions. Such modeling premise has limited use of real-world data and is challenged when modeling real-world systems due to the lack of empirical grounding. Simultaneously, the last decade has witnessed the production and availability of large-scale data from various sensors that carry behavioral signals. These data sources have the potential to change the way we create agent-based models; from simple rules to driven by data. Despite this opportunity, the literature has neglected to offer a modeling approach to generate granular agent behaviors from data, creating …


Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis Jan 2019

Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis

Electronic Theses and Dissertations

Self-care activities classification poses significant challenges in identifying children’s unique functional abilities and needs within the exceptional children healthcare system. The accuracy of diagnosing a child's self-care problem, such as toileting or dressing, is highly influenced by an occupational therapists’ experience and time constraints. Thus, there is a need for objective means to detect and predict in advance the self-care problems of children with physical and motor disabilities. We use clustering to discover interesting information from self-care problems, perform automatic classification of binary data, and discover outliers. The advantages are twofold: the advancement of knowledge on identifying self-care problems in …


Longitudinal Tracking Of Physiological State With Electromyographic Signals., Robert Warren Stallard May 2018

Longitudinal Tracking Of Physiological State With Electromyographic Signals., Robert Warren Stallard

Electronic Theses and Dissertations

Electrophysiological measurements have been used in recent history to classify instantaneous physiological configurations, e.g., hand gestures. This work investigates the feasibility of working with changes in physiological configurations over time (i.e., longitudinally) using a variety of algorithms from the machine learning domain. We demonstrate a high degree of classification accuracy for a binary classification problem derived from electromyography measurements before and after a 35-day bedrest. The problem difficulty is increased with a more dynamic experiment testing for changes in astronaut sensorimotor performance by taking electromyography and force plate measurements before, during, and after a jump from a small platform. A …


Cognitive Virtual Admissions Counselor, Kumar Raja Guvindan Raju, Cory Adams, Raghuram Srinivas Apr 2018

Cognitive Virtual Admissions Counselor, Kumar Raja Guvindan Raju, Cory Adams, Raghuram Srinivas

SMU Data Science Review

Abstract. In this paper, we present a cognitive virtual admissions counselor for the Master of Science in Data Science program at Southern Methodist University. The virtual admissions counselor is a system capable of providing potential students accurate information at the time that they want to know it. After the evaluation of multiple technologies, Amazon’s LEX was selected to serve as the core technology for the virtual counselor chatbot. Student surveys were leveraged to collect and generate training data to deploy the natural language capability. The cognitive virtual admissions counselor platform is currently capable of providing an end-to-end conversational dialog to …


Comparing Various Machine Learning Statistical Methods Using Variable Differentials To Predict College Basketball, Nicholas Bennett Jan 2018

Comparing Various Machine Learning Statistical Methods Using Variable Differentials To Predict College Basketball, Nicholas Bennett

Williams Honors College, Honors Research Projects

The purpose of this Senior Honors Project is to research, study, and demonstrate newfound knowledge of various machine learning statistical techniques that are not covered in the University of Akron’s statistics major curriculum. This report will be an overview of three machine-learning methods that were used to predict NCAA Basketball results, specifically, the March Madness tournament. The variables used for these methods, models, and tests will include numerous variables kept throughout the season for each team, along with a couple variables that are used by the selection committee when tournament teams are being picked. The end goal is to find …


Classification With Large Sparse Datasets: Convergence Analysis And Scalable Algorithms, Xiang Li Jul 2017

Classification With Large Sparse Datasets: Convergence Analysis And Scalable Algorithms, Xiang Li

Electronic Thesis and Dissertation Repository

Large and sparse datasets, such as user ratings over a large collection of items, are common in the big data era. Many applications need to classify the users or items based on the high-dimensional and sparse data vectors, e.g., to predict the profitability of a product or the age group of a user, etc. Linear classifiers are popular choices for classifying such datasets because of their efficiency. In order to classify the large sparse data more effectively, the following important questions need to be answered.

1. Sparse data and convergence behavior. How different properties of a dataset, such as …


Audio-Based Productivity Forecasting Of Construction Cyclic Activities, Chris A. Sabillon Jan 2017

Audio-Based Productivity Forecasting Of Construction Cyclic Activities, Chris A. Sabillon

Electronic Theses and Dissertations

Due to its high cost, project managers must be able to monitor the performance of construction heavy equipment promptly. This cannot be achieved through traditional management techniques, which are based on direct observation or on estimations from historical data. Some manufacturers have started to integrate their proprietary technologies, but construction contractors are unlikely to have a fleet of entirely new and single manufacturer equipment for this to represent a solution. Third party automated approaches include the use of active sensors such as accelerometers and gyroscopes, passive technologies such as computer vision and image processing, and audio signal processing. Hitherto, most …


Towards Deeper Understanding In Neuroimaging, Rex Devon Hjelm Nov 2016

Towards Deeper Understanding In Neuroimaging, Rex Devon Hjelm

Computer Science ETDs

Neuroimaging is a growing domain of research, with advances in machine learning having tremendous potential to expand understanding in neuroscience and improve public health. Deep neural networks have recently and rapidly achieved historic success in numerous domains, and as a consequence have completely redefined the landscape of automated learners, giving promise of significant advances in numerous domains of research. Despite recent advances and advantages over traditional machine learning methods, deep neural networks have yet to have permeated significantly into neuroscience studies, particularly as a tool for discovery. This dissertation presents well-established and novel tools for unsupervised learning which aid in …


Incorporating Boltzmann Machine Priors For Semantic Labeling In Images And Videos, Andrew Kae Aug 2014

Incorporating Boltzmann Machine Priors For Semantic Labeling In Images And Videos, Andrew Kae

Doctoral Dissertations

Semantic labeling is the task of assigning category labels to regions in an image. For example, a scene may consist of regions corresponding to categories such as sky, water, and ground, or parts of a face such as eyes, nose, and mouth. Semantic labeling is an important mid-level vision task for grouping and organizing image regions into coherent parts. Labeling these regions allows us to better understand the scene itself as well as properties of the objects in the scene, such as their parts, location, and interaction within the scene. Typical approaches for this task include the conditional random field …


Using Methods From The Data-Mining And Machine-Learning Literature For Disease Classification And Prediction: A Case Study Examining Classification Of Heart Failure Subtypes, Peter C. Austin Jan 2013

Using Methods From The Data-Mining And Machine-Learning Literature For Disease Classification And Prediction: A Case Study Examining Classification Of Heart Failure Subtypes, Peter C. Austin

Peter Austin

OBJECTIVE: Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine-learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines.

STUDY DESIGN AND SETTING: We compared the performance of these classification methods with that of conventional classification trees to classify patients with heart failure (HF) …


Bayesian And Related Methods: Techniques Based On Bayes' Theorem, Mehmet Vurkaç May 2012

Bayesian And Related Methods: Techniques Based On Bayes' Theorem, Mehmet Vurkaç

Systems Science Friday Noon Seminar Series

Bayes' theorem is a simple algebraic consequence of conditional probability. Yet, its consequences are critical to philosophy, society, and technology. Starting from its simple derivation, we will show how its interpretation in terms of base rates (priors) and class-conditional likelihoods illuminates everyday problems in medicine and law, and provides signal processing, communications, machine learning, model selection, and other applications of statistics with powerful classification and estimation tools. Next, we will briefly examine some of the ways in which this theorem can be adopted to include multiple attributes, contexts, hypotheses, and levels of risk. Methods derived from or related to Bayes’ …