Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

1,479 Full-Text Articles 2,956 Authors 435,013 Downloads 189 Institutions

All Articles in Data Science

Faceted Search

1,479 full-text articles. Page 3 of 73.

Principal Component Analysis With Application To Credit Card Data, Eleanor Cain, Semhar Michael, Gary Hatfield 2024 South Dakota State University

Principal Component Analysis With Application To Credit Card Data, Eleanor Cain, Semhar Michael, Gary Hatfield

SDSU Data Science Symposium

Principal Component Analysis (PCA) is a type of dimension reduction technique used in data analysis to process the data before making a model. In general, dimension reduction allows analysts to make conclusions about large data sets by reducing the number of variables while retaining as much information as possible. Using the numerical variables from a data set, PCA aims to compute a smaller set of uncorrelated variables, called principal components, that account for a majority of the variability from the data. The purpose of this poster is to understand PCA as well as perform PCA on a large sample credit …


Session 6: Model-Based Clustering Analysis On The Spatial-Temporal And Intensity Patterns Of Tornadoes, Yana Melnykov, Yingying Zhang, Rong Zheng 2024 University of Alabama - Tuscaloosa

Session 6: Model-Based Clustering Analysis On The Spatial-Temporal And Intensity Patterns Of Tornadoes, Yana Melnykov, Yingying Zhang, Rong Zheng

SDSU Data Science Symposium

Tornadoes are one of the nature’s most violent windstorms that can occur all over the world except Antarctica. Previous scientific efforts were spent on studying this nature hazard from facets such as: genesis, dynamics, detection, forecasting, warning, measuring, and assessing. While we want to model the tornado datasets by using modern sophisticated statistical and computational techniques. The goal of the paper is developing novel finite mixture models and performing clustering analysis on the spatial-temporal and intensity patterns of the tornadoes. To analyze the tornado dataset, we firstly try a Gaussian distribution with the mean vector and variance-covariance matrix represented as …


Making Sense Of Making Parole In New York, Alexandra McGlinchy 2024 The Graduate Center, City University of New York

Making Sense Of Making Parole In New York, Alexandra Mcglinchy

Dissertations, Theses, and Capstone Projects

For many individuals incarcerated in New York, the initial step toward freedom begins with an interview with the Board of Parole. This process, however, is frequently a complex and challenging one, characterized by repeated denials and extended incarcerations. The disparity in outcomes – where one individual may receive over 20 denials and another is granted parole on their first attempt – highlights the ambiguity and inconsistency in the parole decision-making process. This project aims to clarify the factors that influence parole decisions by concentrating on measurable variables. These include age, race, duration of sentence served, proportion of sentence served, type …


What Does One Billion Dollars Look Like?: Visualizing Extreme Wealth, William Mahoney Luckman 2024 The Graduate Center, City University of New York

What Does One Billion Dollars Look Like?: Visualizing Extreme Wealth, William Mahoney Luckman

Dissertations, Theses, and Capstone Projects

The word “billion” is a mathematical abstraction related to “big,” but it is difficult to understand the vast difference in value between one million and one billion; even harder to understand the vast difference in purchasing power between one billion dollars, and the average U.S. yearly income. Perhaps most difficult to conceive of is what that purchasing power and huge mass of capital translates to in terms of power. This project blends design, text, facts, and figures into an interactive narrative website that helps the user better understand their position in relation to extreme wealth: https://whatdoesonebilliondollarslooklike.website/

The site incorporates …


Artificial Intelligence For The Electron Ion Collider (Ai4eic), C. Allaire, ..., Cristiano Fanelli, James Giroux, Joey Niestroy, Justin R. Stevens, Patrick Stone, L. Suarez, K. Suresh, Eric Walter, et al. 2024 William & Mary

Artificial Intelligence For The Electron Ion Collider (Ai4eic), C. Allaire, ..., Cristiano Fanelli, James Giroux, Joey Niestroy, Justin R. Stevens, Patrick Stone, L. Suarez, K. Suresh, Eric Walter, Et Al.

Arts & Sciences Articles

The Electron-Ion Collider (EIC), a state-of-the-art facility for studying the strong force, is expected to begin commissioning its first experiments in 2028. This is an opportune time for artificial intelligence (AI) to be included from the start at this facility and in all phases that lead up to the experiments. The second annual workshop organized by the AI4EIC working group, which recently took place, centered on exploring all current and prospective application areas of AI for the EIC. This workshop is not only beneficial for the EIC, but also provides valuable insights for the newly established ePIC collaboration at EIC. …


Modeling Of Covid-19 Clinical Outcomes In Mexico: An Analysis Of Demographic, Clinical, And Chronic Disease Factors, Livia Clarete 2024 The Graduate Center, City University of New York

Modeling Of Covid-19 Clinical Outcomes In Mexico: An Analysis Of Demographic, Clinical, And Chronic Disease Factors, Livia Clarete

Dissertations, Theses, and Capstone Projects

This study explores COVID-19 clinical outcomes in Mexico, focusing on demographic, clinical, and chronic disease variables to develop predictive models. In the binary classification task, the Ada Boost Classifier distinguishes survivors from non-survivors, with age, sex, ethnicity, and chronic medical conditions influencing outcomes. In multiclass classification, the Gradient Boosting Classifier categorizes patients into outcome groups.

Demographic variables, especially age, are crucial for predicting COVID-19 outcomes for both the binary and multiclass classification tasks. Clinical information about previous conditions, including chronic diseases, also holds relevance, especially diabetes, immunocompromise, and cardiovascular diseases. These insights inform public health measures and healthcare strategies, emphasizing …


Clustering Of Patients With Heart Disease, Mukadder Cinar 2024 The Graduate Center, City University of New York

Clustering Of Patients With Heart Disease, Mukadder Cinar

Dissertations, Theses, and Capstone Projects

Heart disease, a leading cause of mortality worldwide, presents complex challenges in public health due to its varied manifestations. Accurate diagnosis and patient stratification are essential for effective management and improved outcomes. In response, this study employed machine learning techniques to analyze heart disease data obtained from UCI Machine Learning Repository, aiming to enhance patient care through advanced data analysis.

The study began with the application of K-Nearest Neighbors (KNN) classification, which categorized patients into 'Disease' and 'No Disease' groups. This preliminary step provided initial insights into the structure of the dataset. Subsequently, K-means clustering was applied in two rounds, …


The Impact Of Accessible Data On Cyberstalking, Elise Kwan 2024 Purdue University

The Impact Of Accessible Data On Cyberstalking, Elise Kwan

The Journal of Purdue Undergraduate Research

No abstract provided.


Model Selection Through Cross-Validation For Supervised Learning Tasks With Manifold Data, Derek Brown 2024 Purdue University Fort Wayne

Model Selection Through Cross-Validation For Supervised Learning Tasks With Manifold Data, Derek Brown

The Journal of Purdue Undergraduate Research

No abstract provided.


Machine Learning Of Big Data: A Gaussian Regression Model To Predict The Spatiotemporal Distribution Of Ground Ozone, Jerry Gu 2024 Purdue University

Machine Learning Of Big Data: A Gaussian Regression Model To Predict The Spatiotemporal Distribution Of Ground Ozone, Jerry Gu

The Journal of Purdue Undergraduate Research

Tracking pollution levels on the ground is important to the environment and public health. One of the pollutants of concern is ozone, which, at high concentrations, can cause respiratory and cardiovascular problems. The National Center for Atmospheric Research (NCAR) has published valuable ozone data obtained from ground-based sensors installed at selected locations. Because it is unfeasible to measure the exact ozone levels everywhere at any time, it would be valuable to predict the temporal-spatial distributions of ozone concentration based on existing data. This would help us better understand the patterns and trends in the data and make better decisions to …


A Computational Profile Of Invasive Lionfish In Belize: A New Insight On A Destructive Species, Joshua E. Balan 2024 Purdue University

A Computational Profile Of Invasive Lionfish In Belize: A New Insight On A Destructive Species, Joshua E. Balan

The Journal of Purdue Undergraduate Research

Since their discovery in the region in 2009, invasive Indonesian-native lionfish have been taking over the Belize Barrier Reef. As a result, populations of local species have dwindled as they are either eaten or outcompeted by the invaders. This has led to devastating losses ecologically and economically; massive industries in the local nations, such as fisheries and tourism, have suffered greatly. Attempting to combat this, local organizations, from nonprofits to ecotourism companies, have been manually spear-hunting them on scuba dives to cull the population. One such company, Reef Conservation Institute (ReefCI), operating out of Tom Owens Caye outside of Placencia, …


Henderson Named One Of The Most Influential People In Legal Education, James Owsley Boyd 2024 Maurer School of Law: Indiana University

Henderson Named One Of The Most Influential People In Legal Education, James Owsley Boyd

Keep Up With the Latest News from the Law School (blog)

Indiana University Maurer School of Law Professor Bill Henderson has once again been recognized as one of the most influential people in legal education, but he’s not the only one with ties to the Law School on this year’s list.

The National Jurist ranked Henderson #18 on its list. Kellye Testy, a 1991 alumna of the Law School and president and CEO of the Law School Admission Council, is ranked second.


Molecular Understanding And Design Of Deep Eutectic Solvents And Proteins Using Computer Simulations And Machine Learning, Usman Lame Abbas 2024 University of Kentucky

Molecular Understanding And Design Of Deep Eutectic Solvents And Proteins Using Computer Simulations And Machine Learning, Usman Lame Abbas

Theses and Dissertations--Chemical and Materials Engineering

Hydrophobic deep eutectic solvents (DESs) have emerged as excellent extractants. A major challenge is the lack of an efficient tool to discover DES candidates. Currently, the search relies heavily on the researchers’ intuition or a trial-and-error process, which leads to a low success rate or bypassing of promising candidates. DES performance depends on the heterogeneous hydrogen bond environment formed by multiple hydrogen bond donors and acceptors. Understanding this heterogeneous hydrogen bond environment can help develop principles for designing high performance DESs for extraction and other separation applications. This work investigates the structure and dynamics of hydrogen bonds in hydrophobic DESs …


In Pursuit Of Consumption-Based Forecasting, Charles Chase, Kenneth B. Kahn 2024 SAS

In Pursuit Of Consumption-Based Forecasting, Charles Chase, Kenneth B. Kahn

Marketing Faculty Publications

[Introduction] Today's most mature, most sophisticated, best-in-class forecasting is what we call consumption-based forecasting (CBF). In contrast, the least sophisticated companies typically do not forecast at all, but rather set financial targets based on management expectations. Companies beginning to use statistical forecasting techniques usually take a supply-centric orientation, relying on time series techniques applied to shipment and/or order history. The next stage of progression is to incorporate promotions data, economic data, and market data alongside supply-centric data so that regression and other advanced analytics can be used. Companies pursing CBF utilize even more advanced capabilities to capture, examine, and understand …


Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe 2024 University of Central Florida

Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe

Data Science and Data Mining

Cyberbullying refers to the act of bullying using electronic means and the internet. In recent years, this act has been identifed to be a major problem among young people and even adults. It can negatively impact one’s emotions and lead to adverse outcomes like depression, anxiety, harassment, and suicide, among others. This has led to the need to employ machine learning techniques to automatically detect cyberbullying and prevent them on various social media platforms. In this study, we want to analyze the combination of some Natural Language Processing (NLP) algorithms (such as Bag-of-Words and TFIDF) with some popular machine learning …


Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe 2024 University of Central Florida

Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe

Data Science and Data Mining

This project estimates a regression model to predict the superconducting critical temperature based on variables extracted from the superconductor’s chemical formula. The regression model along with the stepwise variable selection gives a reasonable and good predictive model with a lower prediction error (MSE). Variables extracted based on atomic radius, valence, atomic mass and thermal conductivity appeared to have the most contribution to the predictive model.


A Bayesian Inversion For Emissions And Export Productivity Across The End-Cretaceous Boundary, Alexander A. Cox 2024 Dartmouth College

A Bayesian Inversion For Emissions And Export Productivity Across The End-Cretaceous Boundary, Alexander A. Cox

Dartmouth College Master’s Theses

The end-Cretaceous mass extinction was marked by both the Chicxulub impact and the ongoing emplacement of the Deccan Traps flood basalt province. Both of these events perturbed the environment by the emission of climate-active volatiles, primarily CO2 and SO2. To understand the mechanism of extinction, we must disentangle the timing, duration, and intensity of volcanic and meteoritic environmental forcings. In this thesis, we used a parallel Markov chain Monte Carlo approach to invert for the aforementioned volatile emissions, export productivity, and remineralization from 67 to 65 million years ago using the LOSCAR (Long-term Ocean-atmosphere-Sediment CArbon cycle Reservoir) model. The parallel …


Towards Algorithmic Justice: Human Centered Approaches To Artificial Intelligence Design To Support Fairness And Mitigate Bias In The Financial Services Sector, Jihyun Kim 2024 Claremont Colleges

Towards Algorithmic Justice: Human Centered Approaches To Artificial Intelligence Design To Support Fairness And Mitigate Bias In The Financial Services Sector, Jihyun Kim

CMC Senior Theses

Artificial Intelligence (AI) has positively transformed the Financial services sector but also introduced AI biases against protected groups, amplifying existing prejudices against marginalized communities. The financial decisions made by biased algorithms could cause life-changing ramifications in applications such as lending and credit scoring. Human Centered AI (HCAI) is an emerging concept where AI systems seek to augment, not replace human abilities while preserving human control to ensure transparency, equity and privacy. The evolving field of HCAI shares a common ground with and can be enhanced by the Human Centered Design principles in that they both put humans, the user, at …


Methods That Support The Validation Of Agent-Based Models: An Overview And Discussion, Andrew Collins, Matthew Koehler, Christopher Lynch 2024 Old Dominion University

Methods That Support The Validation Of Agent-Based Models: An Overview And Discussion, Andrew Collins, Matthew Koehler, Christopher Lynch

Engineering Management & Systems Engineering Faculty Publications

Validation is the process of determining if a model adequately represents the system under study for the model’s intended purpose. Validation is a critical component in building the credibility of a simulation model with its end-users. Effectively conducting validation can be a daunting task for both novice and experienced simulation developers. Further compounding the difficult task of conducting validation is that there is no universally accepted approach for assessing a simulation. These challenges are particularly relevant to the paradigm of Agent-Based Modeling and Simulation (ABMS) because of the complexity found in these models’ mechanisms and in the real-world situations they …


A Holistic Approach To Performance Prediction In Collegiate Athletics: Player, Team, And Conference Perspectives, Christopher Taber, S. Sharma, Mehul S. Raval, Samah Senbel, Allison Keefe, Jui Shah, Emma Patterson, Julie K. Nolan, N.S. Artan, Tolga Kaya 2024 Sacred Heart University

A Holistic Approach To Performance Prediction In Collegiate Athletics: Player, Team, And Conference Perspectives, Christopher Taber, S. Sharma, Mehul S. Raval, Samah Senbel, Allison Keefe, Jui Shah, Emma Patterson, Julie K. Nolan, N.S. Artan, Tolga Kaya

Exercise Science Faculty Publications

Predictive sports data analytics can be revolutionary for sports performance. Existing literature discusses players' or teams' performance, independently or in tandem. Using Machine Learning (ML), this paper aims to holistically evaluate player-, team-, and conference (season)-level performances in Division-1 Women's basketball. The players were monitored and tested through a full competitive year. The performance was quantified at the player level using the reactive strength index modified (RSImod), at the team level by the game score (GS) metric, and finally at the conference level through Player Efficiency Rating (PER). The data includes parameters from training, subjective stress, sleep, and recovery (WHOOP …


Digital Commons powered by bepress