Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Theses/Dissertations

Statistics and Probability

Classification

Institution
Publication Year
Publication

Articles 1 - 29 of 29

Full-Text Articles in Physical Sciences and Mathematics

Comparative Study Of Supervised Classification Techniques With A Modified Knn Algorithm, Noah Owusu Aug 2023

Comparative Study Of Supervised Classification Techniques With A Modified Knn Algorithm, Noah Owusu

Open Access Theses & Dissertations

The goal of classification is to develop a model that can be used to accurately assign new observations to labeled classes based on the patterns learned from the training data. K-nearest Neighbors algorithm (KNN) is a popular and widely used algorithm for classification, however, its performance can be adversely affected by the presence of outliers in a dataset. In this study we have modified this existing KNN algorithm that can alleviate the effect of outliers in a dataset, thereby improving the performance of the KNN algorithm. We compared the performances of the Modified KNN method and the Existing KNN algorithm …


Modeling The Probability Of A Successful Stolen Base Attempt In Major League Baseball, Cade Stanley Apr 2023

Modeling The Probability Of A Successful Stolen Base Attempt In Major League Baseball, Cade Stanley

Senior Theses

In Major League Baseball (MLB), the outcome of a stolen base attempt has important implications. Success moves the runner closer to scoring, while failure records an out and removes the runner from the basepaths altogether. Therefore, it is important that the decision by a coach or player to steal a base is well-informed. In this thesis, I explore a statistical approach to making this decision. I train logistic regression and random forest models, using data about the game situation and about the runner, pitcher, and catcher involved in the stolen base attempt, to estimate the probability that a stolen base …


Analyzing Relationships With Machine Learning, Oscar Ko Feb 2023

Analyzing Relationships With Machine Learning, Oscar Ko

Dissertations, Theses, and Capstone Projects

Procedurally, this project aims to take a dataset, analyze it, and offer insights to the audience in an easy-to-digest format. Conceptually, this project will seek to explore questions like: “Do couples that meet through online dating or dating apps have higher or lower quality relationships?”, “Can any features in this dataset help predict how a subject would rate their relationship quality?”, and “What other insights can I derive from using machine learning for exploratory analysis?” The intended audience for this project is anyone interested in romantic relationships or machine learning.

The dataset is from a Stanford University survey, “How Couples …


Bayesian Nonparametric Model For Functional Data Analysis, Tahmidul Islam Apr 2021

Bayesian Nonparametric Model For Functional Data Analysis, Tahmidul Islam

Theses and Dissertations

Functional data analysis (FDA) experienced a burst of growth after Ramsay and Silverman published their textbook in 1997. Functional data analysis interests researchers because of the challenges it adds to well-established multivariate analysis. Unlike finite dimensional random vectors, we visualize infinite dimensional random functions; for example, curves, images, brain scans, etc. A vast amount of literature have been dedicated to developing models for functional data. The ideas are mostly based on basis function representations and kernel-based nonparametric methods. In this dissertation, we propose a Bayesian treatment of nonparametric functional data analysis by introducing a Gaussian process (GP) over the space …


Machine Learning Approaches For Improving Prediction Performance Of Structure-Activity Relationship Models, Gabriel Idakwo Aug 2020

Machine Learning Approaches For Improving Prediction Performance Of Structure-Activity Relationship Models, Gabriel Idakwo

Dissertations

In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies.

First, to improve the prediction accuracy of learning …


A Study Of The Efficacy Of Machine Learning For Diagnosing Obstructive Coronary Artery Disease In Non-Diabetic Patients, Demond Larae Handley Jul 2020

A Study Of The Efficacy Of Machine Learning For Diagnosing Obstructive Coronary Artery Disease In Non-Diabetic Patients, Demond Larae Handley

Theses and Dissertations

According to the Centers for Disease Control and Prevention, about 18.2 million adults age 20 and older have Coronary Artery Disease in the United States. Early diagnosis is therefore of crucial importance to help prevent debilitating consequences, and principally death for many patients. In this study we use data containing gene expression values from peripheral blood samples in 198 non-diabetic patients, with the goal of developing an age and sex gene expression model for diagnosis of Coronary Artery Disease. We employ machine learning methods to obtain a classification based on genetic information, age and sex. Our implementation uses feed forward …


Gradient Boosting For Survival Analysis With Applications In Oncology, Nam Phuong Nguyen Jan 2020

Gradient Boosting For Survival Analysis With Applications In Oncology, Nam Phuong Nguyen

USF Tampa Graduate Theses and Dissertations

Cancer is one of the most deadly diseases that the world has been fighting against over decades. An enormous number of research has been conducted, via a wide scale of approaches, raging from genetic analysis to mathematical modeling. Survival analysis is a well-performed methodology frequently used to estimate the survival probability of a patient. Although there has been a large number of methods for survival analysis, efficient exploration of a high-dimensional feature space has been challenging due to its computational cost and complexity. This thesis adapts the component-wise gradient boosting algorithms for cancer survival analysis, and also proposes a new …


Fractional Random Weighted Bootstrapping For Classification On Imbalanced Data With Ensemble Decision Tree Methods, Sean Charles Carter Nov 2019

Fractional Random Weighted Bootstrapping For Classification On Imbalanced Data With Ensemble Decision Tree Methods, Sean Charles Carter

USF Tampa Graduate Theses and Dissertations

Ensemble methods are commonly used for building predictive models for classification. Models that are unstable to perturbations in the training set, such as the decision tree, often see considerable reductions in error when grouped, using bootstrapped resamples of the training data to train many models. The non-parametric bootstrap, however, has limited efficacy when used on severely imbalanced data, especially when the number of observations of one or more classes is exceptionally small. We explore the fractional random weighted bootstrap, which randomly assigns fractional weights to observations, as an alternative resampling pro cedure in training machine learning ensembles, particularly decision tree …


Adaptive Feature Engineering Modeling For Ultrasound Image Classification For Decision Support, Hatwib Mugasa Oct 2019

Adaptive Feature Engineering Modeling For Ultrasound Image Classification For Decision Support, Hatwib Mugasa

Doctoral Dissertations

Ultrasonography is considered a relatively safe option for the diagnosis of benign and malignant cancer lesions due to the low-energy sound waves used. However, the visual interpretation of the ultrasound images is time-consuming and usually has high false alerts due to speckle noise. Improved methods of collection image-based data have been proposed to reduce noise in the images; however, this has proved not to solve the problem due to the complex nature of images and the exponential growth of biomedical datasets. Secondly, the target class in real-world biomedical datasets, that is the focus of interest of a biopsy, is usually …


Classification With Measurement Error In Covariates Or Response, With Application To Prostate Cancer Imaging Study, Kexin Luo Aug 2019

Classification With Measurement Error In Covariates Or Response, With Application To Prostate Cancer Imaging Study, Kexin Luo

Electronic Thesis and Dissertation Repository

The research is motivated by the prostate cancer imaging study conducted at the University of Western Ontario to classify cancer status using multiple in-vivo images. The prostate cancer histological image and the in-vivo images are subject to misalignment in the co-registration procedure, which can be viewed as measurement error in covariates or response. We investigate methods to correct this problem.

The first proposed method corrects the predicted class probability when the data has misclassified labels. The correction equation is derived from the relationship between the true response and the error-prone response. The probability for the observed class label is adjusted …


Deep Learning Analysis Of Limit Order Book, Xin Xu May 2018

Deep Learning Analysis Of Limit Order Book, Xin Xu

Arts & Sciences Electronic Theses and Dissertations

In this paper, we build a deep neural network for modeling spatial structure in limit order book and make prediction for future best ask or best bid price based on ideas of (Sirignano 2016). We propose an intuitive data processing method to approximate the data is non-available for us based only on level I data that is more widely available. The model is based on the idea that there is local dependence for best ask or best bid price and sizes of related orders. First we use logistic regression to prove that this approach is reasonable. To show the advantages …


Multiclass Classification Using Support Vector Machines, Duleep Prasanna W. Rathgamage Don Jan 2018

Multiclass Classification Using Support Vector Machines, Duleep Prasanna W. Rathgamage Don

Electronic Theses and Dissertations

In this thesis, we discuss different SVM methods for multiclass classification and introduce the Divide and Conquer Support Vector Machine (DCSVM) algorithm which relies on data sparsity in high dimensional space and performs a smart partitioning of the whole training data set into disjoint subsets that are easily separable. A single prediction performed between two partitions eliminates one or more classes in a single partition, leaving only a reduced number of candidate classes for subsequent steps. The algorithm continues recursively, reducing the number of classes at each step until a final binary decision is made between the last two classes …


Classification Of High-Dimensional Data Based On Multiple Testing Methods, Chong Ma Jan 2018

Classification Of High-Dimensional Data Based On Multiple Testing Methods, Chong Ma

Theses and Dissertations

Supervised and unsupervised classification are common topics in machine learning in both scientific and industrial fields, which usually involve three tasks: prediction, exploration, and explanation. False discovery rate (FDR) theory has a close connection to classical classification theory, which must be employed in a sophisticated way to achieve good performance in various contexts. The study aims to explore novel supervised classifiers and unsupervised classification approaches for functional data and high-dimensional data in genome study by using FDR, respectively. One work develops a novel classifier for functional data by casting the classification problem into a multiple testing task, which involves using …


Data Analysis Methods Using Persistence Diagrams, Andrew Marchese Aug 2017

Data Analysis Methods Using Persistence Diagrams, Andrew Marchese

Doctoral Dissertations

In recent years, persistent homology techniques have been used to study data and dynamical systems. Using these techniques, information about the shape and geometry of the data and systems leads to important information regarding the periodicity, bistability, and chaos of the underlying systems. In this thesis, we study all aspects of the application of persistent homology to data analysis. In particular, we introduce a new distance on the space of persistence diagrams, and show that it is useful in detecting changes in geometry and topology, which is essential for the supervised learning problem. Moreover, we introduce a clustering framework directly …


Real-Time Classification Of Biomedical Signals, Parkinson’S Analytical Model, Abolfazl Saghafi Jun 2017

Real-Time Classification Of Biomedical Signals, Parkinson’S Analytical Model, Abolfazl Saghafi

USF Tampa Graduate Theses and Dissertations

The reach of technological innovation continues to grow, changing all industries as it evolves. In healthcare, technology is increasingly playing a role in almost all processes, from patient registration to data monitoring, from lab tests to self-care tools. The increase in the amount and diversity of generated clinical data requires development of new technologies and procedures capable of integrating and analyzing the BIG generated information as well as providing support in their interpretation.

To that extent, this dissertation focuses on the analysis and processing of biomedical signals, specifically brain and heart signals, using advanced machine learning techniques. That is, the …


Statistical Methods For Assessing Individual Oocyte Viability Through Gene Expression Profiles, Michael O. Bishop May 2017

Statistical Methods For Assessing Individual Oocyte Viability Through Gene Expression Profiles, Michael O. Bishop

All Graduate Plan B and other Reports, Spring 1920 to Spring 2023

Abstract

Statistical Methods for Assessing Individual Oocyte Viability Through Gene Expression Profiles

By

Michael O. Bishop

Utah State University, 2017

Major Professor: Dr. John R. Stevens

Department: Mathematics and Statistics

Oocytes are the precursor cells to the female gamete, or egg. While reproduction may vary from species to species, within humans and most domesticated animals, the oocyte maturation process is fairly similar. As an oocyte matures, there are various processes that take place, all of which have an effect on the viability of the individual oocyte. Barring outside damage that may come to the oocyte, one of the primary reasons …


A Framework For The Statistical Analysis Of Mass Spectrometry Imaging Experiments, Kyle Bemis Dec 2016

A Framework For The Statistical Analysis Of Mass Spectrometry Imaging Experiments, Kyle Bemis

Open Access Dissertations

Mass spectrometry (MS) imaging is a powerful investigation technique for a wide range of biological applications such as molecular histology of tissue, whole body sections, and bacterial films , and biomedical applications such as cancer diagnosis. MS imaging visualizes the spatial distribution of molecular ions in a sample by repeatedly collecting mass spectra across its surface, resulting in complex, high-dimensional imaging datasets. Two of the primary goals of statistical analysis of MS imaging experiments are classification (for supervised experiments), i.e. assigning pixels to pre-defined classes based on their spectral profiles, and segmentation (for unsupervised experiments), i.e. assigning pixels to newly …


Identification Of Slums In Mumbai, India: Unsupervised Classification Techniques, Frankie St. Amand Apr 2014

Identification Of Slums In Mumbai, India: Unsupervised Classification Techniques, Frankie St. Amand

Thinking Matters Symposium Archive

Slums are contiguous settlements. Inhabitants lack access to safe water, sanitation and sewage infrastructure, secure housing tenure, uncrowded living space, and permanent, durable housing. Addressing these problematic trends begins with identifying contiguous settlements within Mumbai’s urban fabric. Classifications can be performed using satellite images and remote sensing techniques to yield accurate results. Through literature reviews, socio-cultural analysis, and examination of high resolution satellite imagery, this project aims to develop a systematic, accessible, and reproducible method of classifying Mumbai’s slums.


Regularization Methods For Predicting An Ordinal Response Using Longitudinal High-Dimensional Genomic Data, Jiayi Hou Nov 2013

Regularization Methods For Predicting An Ordinal Response Using Longitudinal High-Dimensional Genomic Data, Jiayi Hou

Theses and Dissertations

Ordinal scales are commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. Notable examples include cancer staging, which is a five-category ordinal scale indicating tumor size, node involvement, and likelihood of metastasizing. Glasgow Coma Scale (GCS), which gives a reliable and objective assessment of conscious status of a patient, is an ordinal scaled measure. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical ordinal modeling methods based on the likelihood approach have contributed to the analysis of data in …


Integrative Biomarker Identification And Classification Using High Throughput Assays, Pan Tong May 2013

Integrative Biomarker Identification And Classification Using High Throughput Assays, Pan Tong

Dissertations & Theses (Open Access)

It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays …


Enhancement Of Random Forests Using Trees With Oblique Splits, Andrejus Parfionovas May 2013

Enhancement Of Random Forests Using Trees With Oblique Splits, Andrejus Parfionovas

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Statistical classification is widely used in many areas where there is a need to make a data-driven decision, or to classify complicated cases or objects. For instance: disease diagnostics (is a patient sick or healthy, based on the blood test results?); weather forecasting (will there be a storm tomorrow, based on today's atmospheric pressure, air temperature, and wind velocity?); speech recognition (what was said over the phone, based on the caller's voice level and articulation); spam detection (can the unsolicited commercial e-mails be identified by their content?); and so on.

Classification trees …


Class Discovery And Prediction Of Tumor With Microarray Data, Bo Liu Jan 2011

Class Discovery And Prediction Of Tumor With Microarray Data, Bo Liu

All Graduate Theses, Dissertations, and Other Capstone Projects

Current microarray technology is able take a single tissue sample to construct an Affymetrix oglionucleotide array containing (estimated) expression levels of thousands of different genes for that tissue. The objective is to develop a more systematic approach to cancer classification based on Affymetrix oglionucleotide microarrays. For this purpose, I studied published colon cancer microarray data. Colon cancer, with 655,000 deaths worldwide per year, has become the fourth most common form of cancer in the United States and the third leading cause of cancer - related death in the Western world. This research has been focuses in two areas: class discovery, …


An Empirical Approach To Evaluating Sufficient Similarity: Utilization Of Euclidean Distance As A Similarity Measure, Scott Marshall May 2010

An Empirical Approach To Evaluating Sufficient Similarity: Utilization Of Euclidean Distance As A Similarity Measure, Scott Marshall

Theses and Dissertations

Individuals are exposed to chemical mixtures while carrying out everyday tasks, with unknown risk associated with exposure. Given the number of resulting mixtures it is not economically feasible to identify or characterize all possible mixtures. When complete dose-response data are not available on a (candidate) mixture of concern, EPA guidelines define a similar mixture based on chemical composition, component proportions and expert biological judgment (EPA, 1986, 2000). Current work in this literature is by Feder et al. (2009), evaluating sufficient similarity in exposure to disinfection by-products of water purification using multivariate statistical techniques and traditional hypothesis testing. The work of …


Cluster And Classification Analysis Of Fossil Invertebrates Within The Bird Spring Formation, Arrow Canyon, Nevada: Implications For Relative Rise And Fall Of Sea-Level, Scott L. Morris Apr 2010

Cluster And Classification Analysis Of Fossil Invertebrates Within The Bird Spring Formation, Arrow Canyon, Nevada: Implications For Relative Rise And Fall Of Sea-Level, Scott L. Morris

Theses and Dissertations

Carbonate strata preserve indicators of local marine environments through time. Such indicators often include microfossils that have relatively unique conditions under which they can survive, including light, nutrients, salinity, and especially water temperature. As such, microfossils are environmental proxies. When these microfossils are preserved in the rock record, they constitute key components of depositional facies. Spence et al. (2004, 2007) has proposed several approaches for determining the facies of a given stratigraphic succession based upon these proxies. Cluster analysis can be used to determine microfossil groups that represent specific environmental conditions. Identifying which microfossil groups exist through time can indicate …


Statistical Learning And Behrens-Fisher Distribution Methods For Heteroscedastic Data In Microarray Analysis, Nabin K. Manandhr-Shrestha Mar 2010

Statistical Learning And Behrens-Fisher Distribution Methods For Heteroscedastic Data In Microarray Analysis, Nabin K. Manandhr-Shrestha

USF Tampa Graduate Theses and Dissertations

The aim of the present study is to identify the di®erentially expressed genes be- tween two di®erent conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statis- ticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal vari- ances in two conditions, but it is not true in general. Since the number of replications is very small, …


Data Mining Methods For Malware Detection, Muazzam Siddiqui Jan 2008

Data Mining Methods For Malware Detection, Muazzam Siddiqui

Electronic Theses and Dissertations

This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval …


Special Classification Models For Lichens In The Pacific Northwest, Janeen Ardito May 2005

Special Classification Models For Lichens In The Pacific Northwest, Janeen Ardito

All Graduate Plan B and other Reports, Spring 1920 to Spring 2023

A common problem in ecological studies is that of determining where to look for rare species. This paper shows how statistical models, such as classification trees, may be used to assist in the design of probability-based surveys for rare species using information on more abundant species that are associated with the rare species. This model assisted approach to survey design involves first building models for the more abundant species. The models are then used to determine stratifications for the rare species that are associated with the more abundant species. The goal of this approach is to increase the number of …


Ip Algorithm Applied To Proteomics Data, Christopher Lee Green Nov 2004

Ip Algorithm Applied To Proteomics Data, Christopher Lee Green

Theses and Dissertations

Mass spectrometry has been used extensively in recent years as a valuable tool in the study of proteomics. However, the data thus produced exhibits hyper-dimensionality. Reducing the dimensionality of the data often requires the imposition of many assumptions which can be harmful to subsequent analysis. The IP algorithm is a dimension reduction algorithm, similar in purpose to latent variable analysis. It is based on the principle of maximum entropy and therefore imposes a minimum number of assumptions on the data. Partial Least Squares (PLS) is an algorithm commonly used with proteomics data from mass spectrometry in order to reduce the …


Discriminant Function Analysis, Kuo Hsiung Su Jan 1975

Discriminant Function Analysis, Kuo Hsiung Su

All Graduate Plan B and other Reports, Spring 1920 to Spring 2023

The technique of discriminant function analysis was originated by R.A. Fisher and first applied by Barnard (1935). Two very useful summaries of the recent work in this technique can be found in Hodges (1950) and in Tosuoka and Tiedeman (1954). The techniques have been used primarily in the fields of anthropology, psychology, biology, medicine, and education, and have only begun to be applied to other fields in recent years.

Classification and discriminant function analyses are two phases in the attempt to predict which of several populations an observation might be a member of, on the basis of multivariate measurements. Both …