Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Machine learning

PDF

Statistics and Probability

Series

Institution
Publication Year
Publication

Articles 1 - 22 of 22

Full-Text Articles in Physical Sciences and Mathematics

Statistical And Machine Learning Approaches To Describe Factors Affecting Preweaning Mortality Of Piglets, Md Towfiqur Rahman, Tami M. Brown-Brandl, Gary A. Rohrer, Sudhendu R. Sharma, Vamsi Manthena, Yeyin Shi Oct 2023

Statistical And Machine Learning Approaches To Describe Factors Affecting Preweaning Mortality Of Piglets, Md Towfiqur Rahman, Tami M. Brown-Brandl, Gary A. Rohrer, Sudhendu R. Sharma, Vamsi Manthena, Yeyin Shi

Biological Systems Engineering: Papers and Publications

High preweaning mortality (PWM) rates for piglets are a significant concern for the worldwide pork industries, causing economic loss and well-being issues. This study focused on identifying the factors affecting PWM, overlays, and predicting PWM using historical production data with statistical and machine learning models. Data were collected from 1,982 litters from the United States Meat Animal Research Center, Nebraska, over the years 2016 to 2021. Sows were housed in a farrowing building with three rooms, each with 20 farrowing crates, and taken care of by well-trained animal caretakers. A generalized linear model was used to analyze the various sow, …


Prediction Of Rapid Early Progression And Survival Risk With Pre-Radiation Mri In Who Grade 4 Glioma Patients, Walia Farzana, Mustafa M. Basree, Norou Diawara, Zeina Shboul, Sagel Dubey, Marie M. Lockheart, Mohamed Hamza, Joshua D. Palmer, Khan Iftekharuddin Jan 2023

Prediction Of Rapid Early Progression And Survival Risk With Pre-Radiation Mri In Who Grade 4 Glioma Patients, Walia Farzana, Mustafa M. Basree, Norou Diawara, Zeina Shboul, Sagel Dubey, Marie M. Lockheart, Mohamed Hamza, Joshua D. Palmer, Khan Iftekharuddin

Electrical & Computer Engineering Faculty Publications

Rapid early progression (REP) has been defined as increased nodular enhancement at the border of the resection cavity, the appearance of new lesions outside the resection cavity, or increased enhancement of the residual disease after surgery and before radiation. Patients with REP have worse survival compared to patients without REP (non-REP). Therefore, a reliable method for differentiating REP from non-REP is hypothesized to assist in personlized treatment planning. A potential approach is to use the radiomics and fractal texture features extracted from brain tumors to characterize morphological and physiological properties. We propose a random sampling-based ensemble classification model. The proposed …


Predictors Of Covid-19 Vaccination Rate In Usa: A Machine Learning Approach, Syed M. I. Osman, Ahmed Sabit Dec 2022

Predictors Of Covid-19 Vaccination Rate In Usa: A Machine Learning Approach, Syed M. I. Osman, Ahmed Sabit

WCBT Faculty Publications

In this study, we examine state-level features and policies that are most important in achieving a threshold level vaccination rate to curve the effects of the COVID-19 pandemic. We employ CHAID, a decision tree algorithm, on three different model specifications to answer this question based on a dataset that includes all the states in the United States. Workplace travel emerges as the most important predictor; however, the governors’ political affiliation (PA) replaces it in a more conservative feature set that includes economic features and the growth rate of COVID-19 cases. We also employ several alternative algorithms as a robustness check. …


Volitional Control Of Lower-Limb Prosthesis With Vision-Assisted Environmental Awareness, S M Shafiul Hasan Mar 2022

Volitional Control Of Lower-Limb Prosthesis With Vision-Assisted Environmental Awareness, S M Shafiul Hasan

FIU Electronic Theses and Dissertations

Early and reliable prediction of user’s intention to change locomotion mode or speed is critical for a smooth and natural lower limb prosthesis. Meanwhile, incorporation of explicit environmental feedback can facilitate context aware intelligent prosthesis which allows seamless operation in a variety of gait demands. This dissertation introduces environmental awareness through computer vision and enables early and accurate prediction of intention to start, stop or change speeds while walking. Electromyography (EMG), Electroencephalography (EEG), Inertial Measurement Unit (IMU), and Ground Reaction Force (GRF) sensors were used to predict intention to start, stop or increase walking speed. Furthermore, it was investigated whether …


A Keyword-Enhanced Approach To Handle Class Imbalance In Clinical Text Classification, Andrew E. Blanchard, Shang Gao, Hong Jun Yoon, J. Blair Christian, Eric B. Durbin, Xiao Cheng Wu, Antoinette Stroup, Jennifer Doherty, Stephen M. Schwartz, Charles Wiggins, Linda Coyle, Lynne Penberthy, Georgia D. Tourassi Jan 2022

A Keyword-Enhanced Approach To Handle Class Imbalance In Clinical Text Classification, Andrew E. Blanchard, Shang Gao, Hong Jun Yoon, J. Blair Christian, Eric B. Durbin, Xiao Cheng Wu, Antoinette Stroup, Jennifer Doherty, Stephen M. Schwartz, Charles Wiggins, Linda Coyle, Lynne Penberthy, Georgia D. Tourassi

School of Public Health Faculty Publications

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as …


Comparing Machine Learning Techniques With State-Of-The-Art Parametric Prediction Models For Predicting Soybean Traits, Susweta Ray Dec 2021

Comparing Machine Learning Techniques With State-Of-The-Art Parametric Prediction Models For Predicting Soybean Traits, Susweta Ray

Department of Statistics: Dissertations, Theses, and Student Work

Soybean is a significant source of protein and oil, and also widely used as animal feed. Thus, developing lines that are superior in terms of yield, protein and oil content is important to feed the ever-growing population. As opposed to the high-cost phenotyping, genotyping is both cost and time efficient for breeders while evaluating new lines in different environments (location-year combinations) can be costly. Several Genomic prediction (GP) methods have been developed to use the marker and environment data effectively to predict the yield or other relevant phenotypic traits of crops. Our study compares a conventional GP method (GBLUP), a …


Developing And Improving Risk Models Using Machine-Learning Based Algorithms, Yan Wang, Sherry Ni Jan 2020

Developing And Improving Risk Models Using Machine-Learning Based Algorithms, Yan Wang, Sherry Ni

Published and Grey Literature from PhD Candidates

The objective of this study is to develop a good risk model for classifying business delinquency by simultaneously exploring several machine learning-based methods including regularization, hyperparameter optimization, and model ensembling algorithms. The rationale under the analyses is firstly to obtain good base binary classifiers (include Logistic Regression (LR), K-Nearest Neighbors (KNN ), Decision Tree (DT), and Artificial Neural Networks (ANN )) via regularization and appropriate settings of hyper-parameters. Then two model ensembling algorithms including bagging and boosting are performed on the good base classifiers for further model improvement. The models are evaluated using accuracy, Area Under the Receiver Operating Characteristic …


An Analysis Of The Success Of Farmers Markets In Kentucky Using Logistic Regression And Support Vector Machines, Jeron Russell Jan 2020

An Analysis Of The Success Of Farmers Markets In Kentucky Using Logistic Regression And Support Vector Machines, Jeron Russell

Mahurin Honors College Capstone Experience/Thesis Projects

The purpose of this research is to look at the relationship that market-specific, economic, and demographic variables have with the success of farmers markets in Kentucky. It additionally seeks to build a tool for predicting farmers market success that could be used by policy makers to aid in decision-making processes concerning farmers markets. Logistic regression and Support Vector Machines (SVMs) are used on data acquired from the Kentucky Department of Agriculture and the American Community Survey in order to analyze the data in a traditional statistical approach as well as a machine learning approach. The results included an SVM model …


The Paradox Of Big Data, Gary N. Smith Jan 2019

The Paradox Of Big Data, Gary N. Smith

Pomona Economics

Data-mining is often used to discover patterns in Big Data. It is tempting believe that because an unearthed pattern is unusual it must be meaningful, but patterns are inevitable in Big Data and usually meaningless. The paradox of Big Data is that data mining is most seductive when there are a large number of variables, but a large number of variables exacerbates the perils of data mining.


A Comparison Of Machine Learning Techniques For Taxonomic Classification Of Teeth From The Family Bovidae, Gregory J. Matthews, Juliet K. Brophy, Maxwell Luetkemeier, Hongie Gu, George K. Thiruvathukal Mar 2018

A Comparison Of Machine Learning Techniques For Taxonomic Classification Of Teeth From The Family Bovidae, Gregory J. Matthews, Juliet K. Brophy, Maxwell Luetkemeier, Hongie Gu, George K. Thiruvathukal

Mathematics and Statistics: Faculty Publications and Other Works

This study explores the performance of machine learning algorithms on the classification of fossil teeth in the Family Bovidae. Isolated bovid teeth are typically the most common fossils found in southern Africa and they often constitute the basis for paleoenvironmental reconstructions. Taxonomic identification of fossil bovid teeth, however, is often imprecise and subjective. Using modern teeth with known taxons, machine learning algorithms can be trained to classify fossils. Previous work by Brophy et al. [Quantitative morphological analysis of bovid teeth and implications for paleoenvironmental reconstruction of plovers lake, Gauteng Province, South Africa, J. Archaeol. Sci. 41 (2014), pp. …


Nonparametric Variable Importance Assessment Using Machine Learning Techniques, Brian D. Williamson, Peter B. Gilbert, Noah Simon, Marco Carone Aug 2017

Nonparametric Variable Importance Assessment Using Machine Learning Techniques, Brian D. Williamson, Peter B. Gilbert, Noah Simon, Marco Carone

UW Biostatistics Working Paper Series

In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often sub-optimal for predicting response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a novel variable importance measure that can …


Identification Of Prognostic Genes And Gene Sets For Early-Stage Non-Small Cell Lung Cancer Using Bi-Level Selection Methods, Suyan Tian, Chi Wang, Howard H. Chang, Jianguo Sun Apr 2017

Identification Of Prognostic Genes And Gene Sets For Early-Stage Non-Small Cell Lung Cancer Using Bi-Level Selection Methods, Suyan Tian, Chi Wang, Howard H. Chang, Jianguo Sun

Biostatistics Faculty Publications

In contrast to feature selection and gene set analysis, bi-level selection is a process of selecting not only important gene sets but also important genes within those gene sets. Depending on the order of selections, a bi-level selection method can be classified into three categories – forward selection, which first selects relevant gene sets followed by the selection of relevant individual genes; backward selection which takes the reversed order; and simultaneous selection, which performs the two tasks simultaneously usually with the aids of a penalized regression model. To test the existence of subtype-specific prognostic genes for non-small cell lung cancer …


Application Of Response Surface Methods To Determine Conditions For Optimal Genomic Prediction, Reka Howard, Alicia L. Carriquiry, William D. Beavis Jan 2017

Application Of Response Surface Methods To Determine Conditions For Optimal Genomic Prediction, Reka Howard, Alicia L. Carriquiry, William D. Beavis

Department of Statistics: Faculty Publications

An epistatic genetic architecture can have a significant impact on prediction accuracies of genomic prediction (GP) methods. Machine learning methods predict traits comprised of epistatic genetic architectures more accurately than statistical methods based on additive mixed linear models. The differences between these types of GP methods suggest a diagnostic for revealing genetic architectures underlying traits of interest. In addition to genetic architecture, the performance of GP methods may be influenced by the sample size of the training population, the number of QTL, and the proportion of phenotypic variability due to genotypic variability (heritability). Possible values for these factors and the …


Biogeographical Patterns Of Soil Microbial Communities: Ecological, Structural, And Functional Diversity And Their Application To Soil Provenance, Natalie Damaso Oct 2016

Biogeographical Patterns Of Soil Microbial Communities: Ecological, Structural, And Functional Diversity And Their Application To Soil Provenance, Natalie Damaso

FIU Electronic Theses and Dissertations

The current ecological hypothesis states that the soil type (e.g., chemical and physical properties) determines which microbes occupy a particular soil and provides the foundation for soil provenance studies. As human profiles are used to determine a match between evidence from a crime scene and a suspect, a soil microbial profile can be used to determine a match between soil found on the suspect’s shoes or clothing to the soil at a crime scene. However, for a robust tool to be applied in forensic application, an understanding of the uncertainty associated with any comparisons and the parameters that can significantly …


Online Cross-Validation-Based Ensemble Learning, David Benkeser, Samuel D. Lendle, Cheng Ju, Mark J. Van Der Laan Oct 2016

Online Cross-Validation-Based Ensemble Learning, David Benkeser, Samuel D. Lendle, Cheng Ju, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to …


Privacy And Accountability In Black-Box Medicine, Roger Allan Ford, W. Nicholson Price Ii Jan 2016

Privacy And Accountability In Black-Box Medicine, Roger Allan Ford, W. Nicholson Price Ii

Law Faculty Scholarship

Black-box medicine—the use of big data and sophisticated machine learning techniques for health-care applications—could be the future of personalized medicine. Black-box medicine promises to make it easier to diagnose rare diseases and conditions, identify the most promising treatments, and allocate scarce resources among different patients. But to succeed, it must overcome two separate, but related, problems: patient privacy and algorithmic accountability. Privacy is a problem because researchers need access to huge amounts of patient health information to generate useful medical predictions. And accountability is a problem because black-box algorithms must be verified by outsiders to ensure they are accurate and …


A Data Science Course For Undergraduates: Thinking With Data, Benjamin Baumer Dec 2015

A Data Science Course For Undergraduates: Thinking With Data, Benjamin Baumer

Mathematics Sciences: Faculty Publications

Data science is an emerging interdisciplinary field that combines elements of mathematics, statistics, computer science, and knowledge in a particular application domain for the purpose of extracting meaningful information from the increasingly sophisticated array of data available in many settings. These data tend to be nontraditional, in the sense that they are often live, large, complex, and/or messy. A first course in statistics at the undergraduate level typically introduces students to a variety of techniques to analyze small, neat, and clean datasets. However, whether they pursue more formal training in statistics or not, many of these students will end up …


Variable Importance And Prediction Methods For Longitudinal Problems With Missing Variables, Ivan Diaz, Alan E. Hubbard, Anna Decker, Mitchell Cohen Oct 2013

Variable Importance And Prediction Methods For Longitudinal Problems With Missing Variables, Ivan Diaz, Alan E. Hubbard, Anna Decker, Mitchell Cohen

U.C. Berkeley Division of Biostatistics Working Paper Series

In this paper we present prediction and variable importance (VIM) methods for longitudinal data sets containing both continuous and binary exposures subject to missingness. We demonstrate the use of these methods for prognosis of medical outcomes of severe trauma patients, a field in which current medical practice involves rules of thumb and scoring methods that only use a few variables and ignore the dynamic and high-dimensional nature of trauma recovery. Well-principled prediction and VIM methods can thus provide a tool to make care decisions informed by the high-dimensional patient’s physiological and clinical history. Our VIM parameters can be causally interpreted …


Asymptotically Unbiased Estimator Of The Informational Energy With Knn, Angel Caţaron, Răzvan Andonie, Chinmei Y. Chueh Oct 2013

Asymptotically Unbiased Estimator Of The Informational Energy With Knn, Angel Caţaron, Răzvan Andonie, Chinmei Y. Chueh

All Faculty Scholarship for the College of the Sciences

Motivated by machine learning applications (e.g., classification, function approximation, feature extraction), in previous work, we have introduced a non- parametric estimator of Onicescu’s informational energy. Our method was based on the k-th nearest neighbor distances between the n sample points, where k is a fixed positive integer. In the present contribution, we discuss mathematical properties of this estimator. We show that our estimator is asymptotically unbiased and consistent. We provide further experimental results which illustrate the convergence of the estimator for standard distributions.


Computationally Efficient Confidence Intervals For Cross-Validated Area Under The Roc Curve Estimates, Erin Ledell, Maya L. Petersen, Mark J. Van Der Laan Dec 2012

Computationally Efficient Confidence Intervals For Cross-Validated Area Under The Roc Curve Estimates, Erin Ledell, Maya L. Petersen, Mark J. Van Der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

In binary classification problems, the area under the ROC curve (AUC), is an effective means of measuring the performance of your model. Most often, cross-validation is also used, in order to assess how the results will generalize to an independent data set. In order to evaluate the quality of an estimate for cross-validated AUC, we must obtain an estimate for its variance. For massive data sets, the process of generating a single performance estimate can be computationally expensive. Additionally, when using a complex prediction method, calculating the cross-validated AUC on even a relatively small data set can still require a …


Empirical Methods For Predicting Student Retention- A Summary From The Literature, Matt Bogard May 2011

Empirical Methods For Predicting Student Retention- A Summary From The Literature, Matt Bogard

Economics Faculty Publications

The vast majority of the literature related to the empirical estimation of retention models includes a discussion of the theoretical retention framework established by Bean, Braxton, Tinto, Pascarella, Terenzini and others (see Bean, 1980; Bean, 2000; Braxton, 2000; Braxton et al, 2004; Chapman and Pascarella, 1983; Pascarell and Ternzini, 1978; St. John and Cabrera, 2000; Tinto, 1975) This body of research provides a starting point for the consideration of which explanatory variables to include in any model specification, as well as identifying possible data sources. The literature separates itself into two major camps including research related to the hypothesis testing …


Quantification Of Artistic Style Through Sparse Coding Analysis In The Drawings Of Pieter Bruegel The Elder, James M. Hughes, Daniel J. Graham, Daniel N. Rockmore Jan 2010

Quantification Of Artistic Style Through Sparse Coding Analysis In The Drawings Of Pieter Bruegel The Elder, James M. Hughes, Daniel J. Graham, Daniel N. Rockmore

Dartmouth Scholarship

Recently, statistical techniques have been used to assist art historians in the analysis of works of art. We present a novel technique for the quantification of artistic style that utilizes a sparse coding model. Originally developed in vision research, sparse coding models can be trained to represent any image space by maximizing the kurtosis of a representation of an arbitrarily selected image from that space. We apply such an analysis to successfully distinguish a set of authentic drawings by Pieter Bruegel the Elder from another set of well-known Bruegel imitations. We show that our approach, which involves a direct comparison …