Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

1,506 Full-Text Articles 3,015 Authors 435,013 Downloads 190 Institutions

All Articles in Data Science

Faceted Search

1,506 full-text articles. Page 4 of 74.

Clustering Of Patients With Heart Disease, Mukadder Cinar 2024 The Graduate Center, City University of New York

Clustering Of Patients With Heart Disease, Mukadder Cinar

Dissertations, Theses, and Capstone Projects

Heart disease, a leading cause of mortality worldwide, presents complex challenges in public health due to its varied manifestations. Accurate diagnosis and patient stratification are essential for effective management and improved outcomes. In response, this study employed machine learning techniques to analyze heart disease data obtained from UCI Machine Learning Repository, aiming to enhance patient care through advanced data analysis.

The study began with the application of K-Nearest Neighbors (KNN) classification, which categorized patients into 'Disease' and 'No Disease' groups. This preliminary step provided initial insights into the structure of the dataset. Subsequently, K-means clustering was applied in two rounds, …


The Impact Of Accessible Data On Cyberstalking, Elise Kwan 2024 Purdue University

The Impact Of Accessible Data On Cyberstalking, Elise Kwan

The Journal of Purdue Undergraduate Research

No abstract provided.


Model Selection Through Cross-Validation For Supervised Learning Tasks With Manifold Data, Derek Brown 2024 Purdue University Fort Wayne

Model Selection Through Cross-Validation For Supervised Learning Tasks With Manifold Data, Derek Brown

The Journal of Purdue Undergraduate Research

No abstract provided.


Machine Learning Of Big Data: A Gaussian Regression Model To Predict The Spatiotemporal Distribution Of Ground Ozone, Jerry Gu 2024 Purdue University

Machine Learning Of Big Data: A Gaussian Regression Model To Predict The Spatiotemporal Distribution Of Ground Ozone, Jerry Gu

The Journal of Purdue Undergraduate Research

Tracking pollution levels on the ground is important to the environment and public health. One of the pollutants of concern is ozone, which, at high concentrations, can cause respiratory and cardiovascular problems. The National Center for Atmospheric Research (NCAR) has published valuable ozone data obtained from ground-based sensors installed at selected locations. Because it is unfeasible to measure the exact ozone levels everywhere at any time, it would be valuable to predict the temporal-spatial distributions of ozone concentration based on existing data. This would help us better understand the patterns and trends in the data and make better decisions to …


A Computational Profile Of Invasive Lionfish In Belize: A New Insight On A Destructive Species, Joshua E. Balan 2024 Purdue University

A Computational Profile Of Invasive Lionfish In Belize: A New Insight On A Destructive Species, Joshua E. Balan

The Journal of Purdue Undergraduate Research

Since their discovery in the region in 2009, invasive Indonesian-native lionfish have been taking over the Belize Barrier Reef. As a result, populations of local species have dwindled as they are either eaten or outcompeted by the invaders. This has led to devastating losses ecologically and economically; massive industries in the local nations, such as fisheries and tourism, have suffered greatly. Attempting to combat this, local organizations, from nonprofits to ecotourism companies, have been manually spear-hunting them on scuba dives to cull the population. One such company, Reef Conservation Institute (ReefCI), operating out of Tom Owens Caye outside of Placencia, …


Henderson Named One Of The Most Influential People In Legal Education, James Owsley Boyd 2024 Maurer School of Law: Indiana University

Henderson Named One Of The Most Influential People In Legal Education, James Owsley Boyd

Keep Up With the Latest News from the Law School (blog)

Indiana University Maurer School of Law Professor Bill Henderson has once again been recognized as one of the most influential people in legal education, but he’s not the only one with ties to the Law School on this year’s list.

The National Jurist ranked Henderson #18 on its list. Kellye Testy, a 1991 alumna of the Law School and president and CEO of the Law School Admission Council, is ranked second.


Molecular Understanding And Design Of Deep Eutectic Solvents And Proteins Using Computer Simulations And Machine Learning, Usman Lame Abbas 2024 University of Kentucky

Molecular Understanding And Design Of Deep Eutectic Solvents And Proteins Using Computer Simulations And Machine Learning, Usman Lame Abbas

Theses and Dissertations--Chemical and Materials Engineering

Hydrophobic deep eutectic solvents (DESs) have emerged as excellent extractants. A major challenge is the lack of an efficient tool to discover DES candidates. Currently, the search relies heavily on the researchers’ intuition or a trial-and-error process, which leads to a low success rate or bypassing of promising candidates. DES performance depends on the heterogeneous hydrogen bond environment formed by multiple hydrogen bond donors and acceptors. Understanding this heterogeneous hydrogen bond environment can help develop principles for designing high performance DESs for extraction and other separation applications. This work investigates the structure and dynamics of hydrogen bonds in hydrophobic DESs …


In Pursuit Of Consumption-Based Forecasting, Charles Chase, Kenneth B. Kahn 2024 SAS

In Pursuit Of Consumption-Based Forecasting, Charles Chase, Kenneth B. Kahn

Marketing Faculty Publications

[Introduction] Today's most mature, most sophisticated, best-in-class forecasting is what we call consumption-based forecasting (CBF). In contrast, the least sophisticated companies typically do not forecast at all, but rather set financial targets based on management expectations. Companies beginning to use statistical forecasting techniques usually take a supply-centric orientation, relying on time series techniques applied to shipment and/or order history. The next stage of progression is to incorporate promotions data, economic data, and market data alongside supply-centric data so that regression and other advanced analytics can be used. Companies pursing CBF utilize even more advanced capabilities to capture, examine, and understand …


Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe 2024 University of Central Florida

Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe

Data Science and Data Mining

Cyberbullying refers to the act of bullying using electronic means and the internet. In recent years, this act has been identifed to be a major problem among young people and even adults. It can negatively impact one’s emotions and lead to adverse outcomes like depression, anxiety, harassment, and suicide, among others. This has led to the need to employ machine learning techniques to automatically detect cyberbullying and prevent them on various social media platforms. In this study, we want to analyze the combination of some Natural Language Processing (NLP) algorithms (such as Bag-of-Words and TFIDF) with some popular machine learning …


Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe 2024 University of Central Florida

Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe

Data Science and Data Mining

This project estimates a regression model to predict the superconducting critical temperature based on variables extracted from the superconductor’s chemical formula. The regression model along with the stepwise variable selection gives a reasonable and good predictive model with a lower prediction error (MSE). Variables extracted based on atomic radius, valence, atomic mass and thermal conductivity appeared to have the most contribution to the predictive model.


A Bayesian Inversion For Emissions And Export Productivity Across The End-Cretaceous Boundary, Alexander A. Cox 2024 Dartmouth College

A Bayesian Inversion For Emissions And Export Productivity Across The End-Cretaceous Boundary, Alexander A. Cox

Dartmouth College Master’s Theses

The end-Cretaceous mass extinction was marked by both the Chicxulub impact and the ongoing emplacement of the Deccan Traps flood basalt province. Both of these events perturbed the environment by the emission of climate-active volatiles, primarily CO2 and SO2. To understand the mechanism of extinction, we must disentangle the timing, duration, and intensity of volcanic and meteoritic environmental forcings. In this thesis, we used a parallel Markov chain Monte Carlo approach to invert for the aforementioned volatile emissions, export productivity, and remineralization from 67 to 65 million years ago using the LOSCAR (Long-term Ocean-atmosphere-Sediment CArbon cycle Reservoir) model. The parallel …


Towards Algorithmic Justice: Human Centered Approaches To Artificial Intelligence Design To Support Fairness And Mitigate Bias In The Financial Services Sector, Jihyun Kim 2024 Claremont Colleges

Towards Algorithmic Justice: Human Centered Approaches To Artificial Intelligence Design To Support Fairness And Mitigate Bias In The Financial Services Sector, Jihyun Kim

CMC Senior Theses

Artificial Intelligence (AI) has positively transformed the Financial services sector but also introduced AI biases against protected groups, amplifying existing prejudices against marginalized communities. The financial decisions made by biased algorithms could cause life-changing ramifications in applications such as lending and credit scoring. Human Centered AI (HCAI) is an emerging concept where AI systems seek to augment, not replace human abilities while preserving human control to ensure transparency, equity and privacy. The evolving field of HCAI shares a common ground with and can be enhanced by the Human Centered Design principles in that they both put humans, the user, at …


A Holistic Approach To Performance Prediction In Collegiate Athletics: Player, Team, And Conference Perspectives, Christopher Taber, S. Sharma, Mehul S. Raval, Samah Senbel, Allison Keefe, Jui Shah, Emma Patterson, Julie K. Nolan, N.S. Artan, Tolga Kaya 2024 Sacred Heart University

A Holistic Approach To Performance Prediction In Collegiate Athletics: Player, Team, And Conference Perspectives, Christopher Taber, S. Sharma, Mehul S. Raval, Samah Senbel, Allison Keefe, Jui Shah, Emma Patterson, Julie K. Nolan, N.S. Artan, Tolga Kaya

Exercise Science Faculty Publications

Predictive sports data analytics can be revolutionary for sports performance. Existing literature discusses players' or teams' performance, independently or in tandem. Using Machine Learning (ML), this paper aims to holistically evaluate player-, team-, and conference (season)-level performances in Division-1 Women's basketball. The players were monitored and tested through a full competitive year. The performance was quantified at the player level using the reactive strength index modified (RSImod), at the team level by the game score (GS) metric, and finally at the conference level through Player Efficiency Rating (PER). The data includes parameters from training, subjective stress, sleep, and recovery (WHOOP …


Combating Cyberbullying On Social Media: A Machine Learning Approach With Text Analysis On Twitter, Amir Alipour Yengejeh 2024 University of Central Florida

Combating Cyberbullying On Social Media: A Machine Learning Approach With Text Analysis On Twitter, Amir Alipour Yengejeh

Data Science and Data Mining

The popularity of the electronic mobile devices along with social media as well as networking websites have been tremendously increased in the recent year. Most people around the world daily engage in the variety of cyberspace additives. Even though the users can take most advantages of these system such as exchange the idea and information, being sociable, and enjoyments, they might be faced with such adverse behaviors such as toxicity, bullying, extremism, and cruelty. The recent statistics reports that such mentioned behaviors has been noticeably grown on the cyberspace such that can threaten the individuals and even any community. Thus, …


Advancing Cancer Classifcation Through Machine Learning Analysis Of Rna-Seq Gene Expression Data, Emil Agbemade, Amina Issoufou Anaroua, Dimitri Bamba 2024 University of Central Florida

Advancing Cancer Classifcation Through Machine Learning Analysis Of Rna-Seq Gene Expression Data, Emil Agbemade, Amina Issoufou Anaroua, Dimitri Bamba

Data Science and Data Mining

This study delves into the classifcation of various cancer types using the RNA-Seq (HiSeq) PANCAN dataset from the UCI Machine Learning Repository, which encompasses a rich collection of gene expression data across multiple tumor samples. To improve cancer diagnosis and treatment, our methodology confronts the challenges inherent in high-dimensional datasets, such as the Hughes Effect and the Curse of Dimensionality, through innovative feature selection methods and machine learning approaches. A key component of our strategy includes the use of tree-based algorithms, particularly Random Forest, to refine the dataset to seventy genes of utmost relevance for tumor classifcation, and the application …


Adaptive Multi-Label Classification On Drifting Data Streams, Martha Roseberry 2024 Virginia Commonwealth University

Adaptive Multi-Label Classification On Drifting Data Streams, Martha Roseberry

Theses and Dissertations

Drifting data streams and multi-label data are both challenging problems. When multi-label data arrives as a stream, the challenges of both problems must be addressed along with additional challenges unique to the combined problem. Algorithms must be fast and flexible, able to match both the speed and evolving nature of the stream. We propose four methods for learning from multi-label drifting data streams. First, a multi-label k Nearest Neighbors with Self Adjusting Memory (ML-SAM-kNN) exploits short- and long-term memories to predict the current and evolving states of the data stream. Second, a punitive k nearest neighbors algorithm with a self-adjusting …


Xgboost Hyperberd Model Using Steam Platform, Yuh-Haur Chen 2024 University of Central Florida

Xgboost Hyperberd Model Using Steam Platform, Yuh-Haur Chen

Data Science and Data Mining

This project investigates game pricing strategies in the Steam market using an XGBoost model, drawing motivation from Professor Xie's lecture, and presenting findings through a density plot that delineates two primary pricing strategies. A free-to-play approach, indicated by a significant hot spot, is adopted by developers focusing on post-purchase revenues through DLC, aesthetic purchases, and in-game transactions. This sailing strategy includes community-centric developers aiming to distribute their games for player engagement rather than profit.

The project illustrates the effectiveness of advanced modeling techniques in handling complex datasets, with significant predictive accuracy reflected by a reduced MSE from 0.3472 to 0.1397. …


Learning Optimal Inter-Class Margin Adaptively For Few-Shot Class-Incremental Learning Via Neural Collapse-Based Meta-Learning, Hang Ran, Weijun Li, Lusi Li, Songsong Tian, Xin Ning, Prayag Tiwari 2024 Chinese Academy of Sciences

Learning Optimal Inter-Class Margin Adaptively For Few-Shot Class-Incremental Learning Via Neural Collapse-Based Meta-Learning, Hang Ran, Weijun Li, Lusi Li, Songsong Tian, Xin Ning, Prayag Tiwari

Computer Science Faculty Publications

Few-Shot Class-Incremental Learning (FSCIL) aims to learn new classes incrementally with a limited number of samples per class. It faces issues of forgetting previously learned classes and overfitting on few-shot classes. An efficient strategy is to learn features that are discriminative in both base and incremental sessions. Current methods improve discriminability by manually designing inter-class margins based on empirical observations, which can be suboptimal. The emerging Neural Collapse (NC) theory provides a theoretically optimal inter-class margin for classification, serving as a basis for adaptively computing the margin. Yet, it is designed for closed, balanced data, not for sequential or few-shot …


Osfs-Vague: Online Streaming Feature Selection Algorithm Based On A Vague Set, Jie Yang, Zhijun Wang, Guoyin Wang, Yanmin Liu, Yi He, Di Wu 2024 Zunyi Normal University

Osfs-Vague: Online Streaming Feature Selection Algorithm Based On A Vague Set, Jie Yang, Zhijun Wang, Guoyin Wang, Yanmin Liu, Yi He, Di Wu

Computer Science Faculty Publications

Online streaming feature selection (OSFS), as an online learning manner to handle streaming features, is critical in addressing high-dimensional data. In real big data-related applications, the patterns and distributions of streaming features constantly change over time due to dynamic data generation environments. However, existing OSFS methods rely on presented and fixed hyperparameters, which undoubtedly lead to poor selection performance when encountering dynamic features. To make up for the existing shortcomings, the authors propose a novel OSFS algorithm based on vague set, named OSFS-Vague. Its main idea is to combine uncertainty and three-way decision theories to improve feature selection from the …


Eluquant: Event-Level Uncertainty Quantification In Deep Inelastic Scattering, Cristiano Fanelli, James Giroux 2024 William & Mary

Eluquant: Event-Level Uncertainty Quantification In Deep Inelastic Scattering, Cristiano Fanelli, James Giroux

Arts & Sciences Articles

We introduce a physics-informed Bayesian neural network with flow-approximated posteriors using multiplicative normalizing flows for detailed uncertainty quantification (UQ) at the physics event-level. Our method is capable of identifying both heteroskedastic aleatoric and epistemic uncertainties, providing granular physical insights. Applied to deep inelastic scattering (DIS) events, our model effectively extracts the kinematic variables x, Q2, and y, matching the performance of recent deep learning regression techniques but with the critical enhancement of event-level UQ. This detailed description of the underlying uncertainty proves invaluable for decision-making, especially in tasks like event filtering. It also allows for the reduction of true inaccuracies …


Digital Commons powered by bepress