Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Theses/Dissertations

Feature selection

Discipline
Institution
Publication Year
Publication
File Type

Articles 1 - 30 of 45

Full-Text Articles in Physical Sciences and Mathematics

Learning Mortality Risk For Covid-19 Using Machine Learning And Statistical Methods, Shaoshi Zhang Dec 2023

Learning Mortality Risk For Covid-19 Using Machine Learning And Statistical Methods, Shaoshi Zhang

Electronic Thesis and Dissertation Repository

This research investigates the mortality risk of COVID-19 patients across different variant waves, using the data from Centers for Disease Control and Prevention (CDC) websites. By analyzing the available data, including patient medical records, vaccination rates, and hospital capacities, we aim to discern patterns and factors associated with COVID-19-related deaths.

To explore features linked to COVID-19 mortality, we employ different techniques such as Filter, Wrapper, and Embedded methods for feature selection. Furthermore, we apply various machine learning methods, including support vector machines, decision trees, random forests, logistic regression, K-nearest neighbours, na¨ıve Bayes methods, and artificial neural networks, to uncover underlying …


Feature Selection From Clinical Surveys Using Semantic Textual Similarity, Benjamin Warner May 2023

Feature Selection From Clinical Surveys Using Semantic Textual Similarity, Benjamin Warner

McKelvey School of Engineering Theses & Dissertations

Survey data collected from human subjects can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names …


Tempering The Adversary: An Exploration Into The Applications Of Game Theoretic Feature Selection And Regression, Stephen Mcgee Aug 2022

Tempering The Adversary: An Exploration Into The Applications Of Game Theoretic Feature Selection And Regression, Stephen Mcgee

All Dissertations

Most modern machine learning algorithms tend to focus on an "average-case" approach, where every data point contributes the same amount of influence towards calculating the fit of a model. This "per-data point" error (or loss) is averaged together into an overall loss and typically minimized with an objective function. However, this can be insensitive to valuable outliers. Inspired by game theory, the goal of this work is to explore the utility of incorporating an optimally-playing adversary into feature selection and regression frameworks. The adversary assigns weights to the data elements so as to degrade the modeler's performance in an optimal …


Efficient Algorithms And Human-In-The-Loop Approaches For Attribute Design And Selection, Md Abdus Salam May 2022

Efficient Algorithms And Human-In-The-Loop Approaches For Attribute Design And Selection, Md Abdus Salam

Computer Science and Engineering Dissertations

Feature engineering and feature selection are two important aspects of data science pipeline. Due to the advancement of data collection techniques in recent years, huge amount of data is becoming available in different industries. Consequently, the importance of data science is increasing for business analytic purpose. Different tools and techniques are being developed to assist data scientists to complete their tasks efficiently. One of the main human involvements in the data science task is for feature engineering and selection. These pre-processing steps will prepare the data in the format desired to be fed into various machine learning algorithms to accomplish …


Generalized Robust Feature Selection, Bradford L. Lott Mar 2022

Generalized Robust Feature Selection, Bradford L. Lott

Theses and Dissertations

Feature selection may be summarized as identifying salient features to a given response. Understanding which features affect the response enables, in the future, only collecting consequential data; hence, the feature selection algorithm may lead to saving effort spent collecting data, storage resources, as well as computational resources for making predictions. We propose a generalized approach to select the salient features of data sets. Our approach may also be applied to unsupervised datasets to understand which data streams provide unique information. We contend our approach identifies salient features robust to the sub-sequent predictive model applied. The proposed algorithm considers all provided …


Local Feature Selection For Multiple Instance Learning With Applications., Aliasghar Shahrjooihaghighi Dec 2021

Local Feature Selection For Multiple Instance Learning With Applications., Aliasghar Shahrjooihaghighi

Electronic Theses and Dissertations

Feature selection is a data processing approach that has been successfully and effectively used in developing machine learning algorithms for various applications. It has been proven to effectively reduce the dimensionality of the data and increase the accuracy and interpretability of machine learning algorithms. Conventional feature selection algorithms assume that there is an optimal global subset of features for the whole sample space. Thus, only one global subset of relevant features is learned. An alternative approach is based on the concept of Local Feature Selection (LFS), where each training sample can have its own subset of relevant features. Multiple Instance …


Enhancing The Performance Of Text Mining, Farah Mahmoud Al Shanik Dec 2021

Enhancing The Performance Of Text Mining, Farah Mahmoud Al Shanik

All Dissertations

The amount of text data produced in science, finance, social media, and medicine is growing at an unprecedented pace. The raw text data typically introduces major computational and analytical obstacles (e.g., extremely high dimensionality) to data mining and machine learning algorithms. Besides, the growth in the size of text data makes the search process more difficult for information retrieval systems, making retrieving relevant results to match the users’ search queries challenging. Moreover, the availability of text data in different languages creates the need to develop new methods to analyze multilingual topics to help policymakers in governmental and health systems to …


High-Dimensional Feature Selection And Multi-Level Causal Mediation Analysis With Applications To Human Aging And Cluster-Based Intervention Studies, Hachem Saddiki Oct 2021

High-Dimensional Feature Selection And Multi-Level Causal Mediation Analysis With Applications To Human Aging And Cluster-Based Intervention Studies, Hachem Saddiki

Doctoral Dissertations

Many questions in public health and medicine are fundamentally causal in that our objective is to learn the effect of some exposure, randomized or not, on an outcome of interest. As a result, causal inference frameworks and methodologies have gained interest as a promising tool to reliably answer scientific questions. However, the tasks of identifying and efficiently estimating causal effects from observed data still pose significant challenges under complex data generating scenarios. We focus on (1) high-dimensional settings where the number of variables is orders of magnitude higher than the number of observations; and (2) multi-level settings, where study participants …


Comparative Study Of Machine Learning Models On Solar Flare Prediction Problem, Nikhil Sai Kurivella Aug 2021

Comparative Study Of Machine Learning Models On Solar Flare Prediction Problem, Nikhil Sai Kurivella

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Solar flare events are explosions of energy and radiation from the Sun’s surface. These events occur due to the tangling and twisting of magnetic fields associated with sunspots. When Coronal Mass ejections accompany solar flares, solar storms could travel towards earth at very high speeds, disrupting all earthly technologies and posing radiation hazards to astronauts. For this reason, the prediction of solar flares has become a crucial aspect of forecasting space weather. Our thesis utilized the time-series data consisting of active solar region magnetic field parameters acquired from SDO that span more than eight years. The classification models take AR …


Designing Targeted Mobile Advertising Campaigns, Kimia Keshanian Jun 2021

Designing Targeted Mobile Advertising Campaigns, Kimia Keshanian

USF Tampa Graduate Theses and Dissertations

With the proliferation of smart, handheld devices, there has been a multifold increase in the ability of firms to target and engage with customers through mobile advertising. Therefore, not surprisingly, mobile advertising campaigns have become an integral aspect of firms’ brand building activities, such as improving the awareness and overall visibility of firms' brands. In addition, retailers are increasingly using mobile advertising for targeted promotional activities that increase in-store visits and eventual sales conversions. However, in recent years, mobile or in general online advertising campaigns have been facing one major challenge and one major threat that can negatively impact the …


Feature Selection On Permissions, Intents And Apis For Android Malware Detection, Fred Guyton Jan 2021

Feature Selection On Permissions, Intents And Apis For Android Malware Detection, Fred Guyton

CCE Theses and Dissertations

Malicious applications pose an enormous security threat to mobile computing devices. Currently 85% of all smartphones run Android, Google’s open-source operating system, making that platform the primary threat vector for malware attacks. Android is a platform that hosts roughly 99% of known malware to date, and is the focus of most research efforts in mobile malware detection due to its open source nature. One of the main tools used in this effort is supervised machine learning. While a decade of work has made a lot of progress in detection accuracy, there is an obstacle that each stream of research is …


Binary Black Widow Optimization Algorithm For Feature Selection Problems, Ahmed Al-Saedi Jan 2021

Binary Black Widow Optimization Algorithm For Feature Selection Problems, Ahmed Al-Saedi

Theses and Dissertations (Comprehensive)

This thesis addresses feature selection (FS) problems, which is a primary stage in data mining. FS is a significant pre-processing stage to enhance the performance of the process with regards to computation cost and accuracy to offer a better comprehension of stored data by removing the unnecessary and irrelevant features from the basic dataset. However, because of the size of the problem, FS is known to be very challenging and has been classified as an NP-hard problem. Traditional methods can only be used to solve small problems. Therefore, metaheuristic algorithms (MAs) are becoming powerful methods for addressing the FS problems. …


Automation Of Feature Selection And Generation Of Optimal Feature Subsets For Beehive Audio Sample Classification, Aditya Bhouraskar Dec 2020

Automation Of Feature Selection And Generation Of Optimal Feature Subsets For Beehive Audio Sample Classification, Aditya Bhouraskar

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

The last couple of decades have witnessed an abnormal phenomenon of reduction in the bee population, this is a serious matter of concern as three out of four crops available globally have honey bee as their sole pollinator causing significant economic losses and an unbalance in the ecosystem. There have been many theories about the cause of bee colony collapses such as parasites, pesticides and poor nutrition however conclusive evidence of this phenomenon is yet to be identified.

Human inspection of beehives requires precision. It takes an experienced beekeeper to determine the health of a hive by the sounds generated …


Feature Selection And Data Reconstruction Via Robust And Flexible Learning Models, Di Ming May 2020

Feature Selection And Data Reconstruction Via Robust And Flexible Learning Models, Di Ming

Computer Science and Engineering Dissertations

Feature selection and data reconstruction are very important topics in machine learning area. In today's big data environment, many data could have high dimensions and come with noise, corruption, etc. Thus, we develop robust and flexible learning models so as to select the relevant features from the high-dimensional data spaces and reconstruct the original clean data from the corrupted input data more efficiently and more effectively. To resolve the inflexibility of the widely used class-shared feature selection methods such as L21-norm, we derive LASSO from probabilistic selection on ridge regression which provides an independent point of view from the usual …


Sparsity And Weak Supervision In Quantum Machine Learning, Seyran Saeedi Jan 2020

Sparsity And Weak Supervision In Quantum Machine Learning, Seyran Saeedi

Theses and Dissertations

Quantum computing is an interdisciplinary field at the intersection of computer science, mathematics, and physics that studies information processing tasks on a quantum computer. A quantum computer is a device whose operations are governed by the laws of quantum mechanics. As building quantum computers is nearing the era of commercialization and quantum supremacy, it is essential to think of potential applications that we might benefit from. Among many applications of quantum computation, one of the emerging fields is quantum machine learning. We focus on predictive models for binary classification and variants of Support Vector Machines that we expect to be …


Image Features For Tuberculosis Classification In Digital Chest Radiographs, Brian Hooper Jan 2020

Image Features For Tuberculosis Classification In Digital Chest Radiographs, Brian Hooper

All Master's Theses

Tuberculosis (TB) is a respiratory disease which affects millions of people each year, accounting for the tenth leading cause of death worldwide, and is especially prevalent in underdeveloped regions where access to adequate medical care may be limited. Analysis of digital chest radiographs (CXRs) is a common and inexpensive method for the diagnosis of TB; however, a trained radiologist is required to interpret the results, and is subject to human error. Computer-Aided Detection (CAD) systems are a promising machine-learning based solution to automate the diagnosis of TB from CXR images. As the dimensionality of a high-resolution CXR image is very …


Sensor - Based Human Activity Recognition Using Smartphones, Mustafa Badshah May 2019

Sensor - Based Human Activity Recognition Using Smartphones, Mustafa Badshah

Master's Projects

It is a significant technical and computational task to provide precise information regarding the activity performed by a human and find patterns of their behavior. Countless applications can be molded and various problems in domains of virtual reality, health and medical, entertainment and security can be solved with advancements in human activity recognition (HAR) systems. HAR is an active field for research for more than a decade, but certain aspects need to be addressed to improve the system and revolutionize the way humans interact with smartphones. This research provides a holistic view of human activity recognition system architecture and discusses …


Streaming Feature Grouping And Selection (Sfgs) For Big Data Classification, Noura Helal Hamad Al Nuaimi Mar 2019

Streaming Feature Grouping And Selection (Sfgs) For Big Data Classification, Noura Helal Hamad Al Nuaimi

Dissertations

Real-time data has always been an essential element for organizations when the quickness of data delivery is critical to their businesses. Today, organizations understand the importance of real-time data analysis to maintain benefits from their generated data. Real-time data analysis is also known as real-time analytics, streaming analytics, real-time streaming analytics, and event processing. Stream processing is the key to getting results in real-time. It allows us to process the data stream in real-time as it arrives. The concept of streaming data means the data are generated dynamically, and the full stream is unknown or even infinite. This data becomes …


Distributed Multi-Label Learning On Apache Spark, Jorge Gonzalez Lopez Jan 2019

Distributed Multi-Label Learning On Apache Spark, Jorge Gonzalez Lopez

Theses and Dissertations

This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes …


Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis Jan 2019

Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis

Electronic Theses and Dissertations

Self-care activities classification poses significant challenges in identifying children’s unique functional abilities and needs within the exceptional children healthcare system. The accuracy of diagnosing a child's self-care problem, such as toileting or dressing, is highly influenced by an occupational therapists’ experience and time constraints. Thus, there is a need for objective means to detect and predict in advance the self-care problems of children with physical and motor disabilities. We use clustering to discover interesting information from self-care problems, perform automatic classification of binary data, and discover outliers. The advantages are twofold: the advancement of knowledge on identifying self-care problems in …


Feature Set Selection For Improved Classification Of Static Analysis Alerts, Kathleen Goeschel Jan 2019

Feature Set Selection For Improved Classification Of Static Analysis Alerts, Kathleen Goeschel

CCE Theses and Dissertations

With the extreme growth in third party cloud applications, increased exposure of applications to the internet, and the impact of successful breaches, improving the security of software being produced is imperative. Static analysis tools can alert to quality and security vulnerabilities of an application; however, they present developers and analysts with a high rate of false positives and unactionable alerts. This problem may lead to the loss of confidence in the scanning tools, possibly resulting in the tools not being used. The discontinued use of these tools may increase the likelihood of insecure software being released into production. Insecure software …


Improving K-Nn Search And Subspace Clustering Based On Local Intrinsic Dimensionality, Arwa M. Wali Jul 2018

Improving K-Nn Search And Subspace Clustering Based On Local Intrinsic Dimensionality, Arwa M. Wali

Dissertations

In several novel applications such as multimedia and recommender systems, data is often represented as object feature vectors in high-dimensional spaces. The high-dimensional data is always a challenge for state-of-the-art algorithms, because of the so-called "curse of dimensionality". As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where many data analysis algorithms, such as similarity search and clustering, that depend on them lose their effectiveness. One way to handle this challenge is by selecting the most important features, which is essential for providing compact object representations as well as improving the overall search and clustering …


The Impact Of Cost On Feature Selection For Classifiers, Richard Clyde Mccrae Jan 2018

The Impact Of Cost On Feature Selection For Classifiers, Richard Clyde Mccrae

CCE Theses and Dissertations

Supervised machine learning models are increasingly being used for medical diagnosis. The diagnostic problem is formulated as a binary classification task in which trained classifiers make predictions based on a set of input features. In diagnosis, these features are typically procedures or tests with associated costs. The cost of applying a trained classifier for diagnosis may be estimated as the total cost of obtaining values for the features that serve as inputs for the classifier. Obtaining classifiers based on a low cost set of input features with acceptable classification accuracy is of interest to practitioners and researchers. What makes this …


Feature Selection From Large Acoustic Feature Sets In Computational Paralinguistics, Dara Pir Jun 2017

Feature Selection From Large Acoustic Feature Sets In Computational Paralinguistics, Dara Pir

Dissertations, Theses, and Capstone Projects

The burgeoning field of computational paralinguistics deals with the ways in which spoken words are uttered and attempts to recognize the states and traits of the speakers. Many areas of current scientific research, including computational paralinguistics, have started to employ datasets with ever increasing number of features. Using large feature sets has helped improve recognition performances. However, processing these large sets has given rise to various problems. Feature selection methods, which reduce the dimensionality of the original feature sets by removing irrelevant and/or redundant features, could be used to address these problems.

The two main methods for feature selection are …


Using Machine Learning To Predict Chemotherapy Response In Cell Lines And Patients Based On Genetic Expression, Dimo Angelov Mar 2017

Using Machine Learning To Predict Chemotherapy Response In Cell Lines And Patients Based On Genetic Expression, Dimo Angelov

Electronic Thesis and Dissertation Repository

The goal of this thesis was to examine different machine learning techniques for predicting chemotherapy response in cell lines and patients based on genetic expression. After trying regression, multi-class classification techniques and binary classification it was concluded that binary classification was the best method for training models due to the limited size of available cell line data. We found support vector machine classifiers trained on cell line data were easier to use and produced better results compared to neural networks. Sequential backward feature selection was able to select genes for the models that produced good results, however the greedy algorithm …


Some Issues In Unsupervised Feature Selection Using Similarity., Partha Pratim Kundu Dr. Aug 2015

Some Issues In Unsupervised Feature Selection Using Similarity., Partha Pratim Kundu Dr.

Doctoral Theses

Pattern recognition is what humans do most of the time, without any conscious effort, and fortunately excel in. Information is received through various sensory organs, processed simultaneously in the brain, and its source is instantaneously identified without any perceptible effort. The interesting issue is that recognition occurs even under non-ideal conditions, i.e., when information is vague, imprecise or incomplete. In reality, most human activities depend on the success in performing various pattern recognition tasks. Let us consider an example. Before boarding a train or bus, we first select the appropriate one by identifying either the route number or its destination …


On Supervised And Unsupervised Methodologies For Mining Of Text Data., Tanmay Basu Dr. Jul 2015

On Supervised And Unsupervised Methodologies For Mining Of Text Data., Tanmay Basu Dr.

Doctoral Theses

The supervised and unsupervised methodologies of text mining using the plain text data of English language have been discussed. Some new supervised and unsupervised methodologies have been developed for effective mining of the text data after successfully overcoming some limitations of the existing techniques.The problems of unsupervised techniques of text mining, i.e., document clustering methods are addressed. A new similarity measure between documents has been designed to improve the accuracy of measuring the content similarity between documents. Further, a hierarchical document clustering technique is designed using this similarity measure. The main significance of the clustering algorithm is that the number …


Local Selection Of Features And Its Applications To Image Search And Annotation, Jichao Sun Jan 2015

Local Selection Of Features And Its Applications To Image Search And Annotation, Jichao Sun

Dissertations

In multimedia applications, direct representations of data objects typically involve hundreds or thousands of features. Given a query object, the similarity between the query object and a database object can be computed as the distance between their feature vectors. The neighborhood of the query object consists of those database objects that are close to the query object. The semantic quality of the neighborhood, which can be measured as the proportion of neighboring objects that share the same class label as the query object, is crucial for many applications, such as content-based image retrieval and automated image annotation. However, due to …


Predictive Analytics For Disease Condition Of Patients In Emergency Department, Azade Tabaie Jan 2015

Predictive Analytics For Disease Condition Of Patients In Emergency Department, Azade Tabaie

Wayne State University Theses

Emergency Departments (EDs) in hospitals are experiencing severe crowding and prolonged patient waiting times. The reported crowding in hospitals shows patients in hospital hallways, long waiting times and full occupancy of ED beds. ED crowding has several potential unfavorable effects including patients and staff frustration, lower patient satisfaction and poor health outcomes. The primary motivations behind this study are shortening the patients’ waiting time and improving patient satisfaction and level of care.

The very initial interaction between clinicians and a patient is recorded on nurse triage notes which contain details of the reason for patient’s visit including specific symptoms and …


Use Of Entropy For Feature Selection With Intrusion Detection System Parameters, Frank Acker Jan 2015

Use Of Entropy For Feature Selection With Intrusion Detection System Parameters, Frank Acker

CCE Theses and Dissertations

The metric of entropy provides a measure about the randomness of data and a measure of information gained by comparing different attributes. Intrusion detection systems can collect very large amounts of data, which are not necessarily manageable by manual means. Collected intrusion detection data often contains redundant, duplicate, and irrelevant entries, which makes analysis computationally intensive likely leading to unreliable results. Reducing the data to what is relevant and pertinent to the analysis requires the use of data mining techniques and statistics. Identifying patterns in the data is part of analysis for intrusion detections in which the patterns are categorized …