Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Data Science

University of Central Florida

Series

Keyword
Publication Year

Articles 1 - 25 of 25

Full-Text Articles in Physical Sciences and Mathematics

Cyberbullying Detection On Twitter Data Using Machine Learning Classifiers, Pradip Dhakal May 2024

Cyberbullying Detection On Twitter Data Using Machine Learning Classifiers, Pradip Dhakal

Data Science and Data Mining

This study compares some of the popular machine learning techniques like Logistic Regression, Multinomial Naive Bayes, K-Nearest Neighbor, and Extreme Gradient Boosting to classify the tweets into three different categories: cyberbullying based on religion, cyberbullying based on ethnicity, or no cyberbullying. First, various data-cleaning approaches are used to clean the tweet data. After the data is clean and ready, the word embedding techniques, such as a bag of words and term frequency-Inverse document frequency, are used to convert the words into mathematical vectors. Finally, the model will be fitted using the combination of the above-mentioned word embedding techniques and machine …


Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe Jan 2024

Predicting Superconducting Critical Temperature Using Regression Analysis, Roland Fiagbe

Data Science and Data Mining

This project estimates a regression model to predict the superconducting critical temperature based on variables extracted from the superconductor’s chemical formula. The regression model along with the stepwise variable selection gives a reasonable and good predictive model with a lower prediction error (MSE). Variables extracted based on atomic radius, valence, atomic mass and thermal conductivity appeared to have the most contribution to the predictive model.


Advancing Cancer Classifcation Through Machine Learning Analysis Of Rna-Seq Gene Expression Data, Emil Agbemade, Amina Issoufou Anaroua, Dimitri Bamba Jan 2024

Advancing Cancer Classifcation Through Machine Learning Analysis Of Rna-Seq Gene Expression Data, Emil Agbemade, Amina Issoufou Anaroua, Dimitri Bamba

Data Science and Data Mining

This study delves into the classifcation of various cancer types using the RNA-Seq (HiSeq) PANCAN dataset from the UCI Machine Learning Repository, which encompasses a rich collection of gene expression data across multiple tumor samples. To improve cancer diagnosis and treatment, our methodology confronts the challenges inherent in high-dimensional datasets, such as the Hughes Effect and the Curse of Dimensionality, through innovative feature selection methods and machine learning approaches. A key component of our strategy includes the use of tree-based algorithms, particularly Random Forest, to refine the dataset to seventy genes of utmost relevance for tumor classifcation, and the application …


Combating Cyberbullying On Social Media: A Machine Learning Approach With Text Analysis On Twitter, Amir Alipour Yengejeh Jan 2024

Combating Cyberbullying On Social Media: A Machine Learning Approach With Text Analysis On Twitter, Amir Alipour Yengejeh

Data Science and Data Mining

The popularity of the electronic mobile devices along with social media as well as networking websites have been tremendously increased in the recent year. Most people around the world daily engage in the variety of cyberspace additives. Even though the users can take most advantages of these system such as exchange the idea and information, being sociable, and enjoyments, they might be faced with such adverse behaviors such as toxicity, bullying, extremism, and cruelty. The recent statistics reports that such mentioned behaviors has been noticeably grown on the cyberspace such that can threaten the individuals and even any community. Thus, …


Diagnostic In Neuroimaging: A Comparative Study Of Deep Learning And Traditional Approaches, Amina Issoufou Anaroua Jan 2024

Diagnostic In Neuroimaging: A Comparative Study Of Deep Learning And Traditional Approaches, Amina Issoufou Anaroua

Data Science and Data Mining

In the realm of medical diagnostics, precise classification of brain tumors is pivotal. This study conducts a comprehensive comparative analysis of a Convolutional Neural Network (CNN) against traditional machine learning models, Logistic Regression (LR) and Support Vector Machines (SVM) on a dataset of MRI scans for multi-class brain tumor classification. The CNN, tailored for image recognition, is evaluated alongside LR and SVM, which have established benchmarks in classification tasks. The investigation reveals that the traditional models hold their ground in terms of precision and interpretability, with the SVM, in particular, achieving remarkable accuracy. However, the CNN distinguishes itself by demonstrating …


Optimizing Ai With Advanced Data Structuring: A Comparative Analysis Of K-Means And Gmm Clustering Techniques, Amir Alipour Yengejeh Jan 2024

Optimizing Ai With Advanced Data Structuring: A Comparative Analysis Of K-Means And Gmm Clustering Techniques, Amir Alipour Yengejeh

Data Science and Data Mining

This study presents a detailed comparison of Kmeans and Gaussian Mixture Model (GMM) clustering algorithms, illustrating their unique capabilities and limitations across various synthetic datasets. By utilizing metrics such as the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), the research provides nuanced insights into how these algorithms handle datasets with varying structures and complexities. For instance, while both K-means and GMM show robust performance on well-separated clusters, GMM demonstrates a distinct advantage in scenarios with overlapping clusters or unbalanced data distributions. Conversely, K-means excels in identifying clear, distinct groupings, highlighting its utility in simpler clustering contexts. This study …


Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe Jan 2024

Machine Learning Approaches For Cyberbullying Detection, Roland Fiagbe

Data Science and Data Mining

Cyberbullying refers to the act of bullying using electronic means and the internet. In recent years, this act has been identifed to be a major problem among young people and even adults. It can negatively impact one’s emotions and lead to adverse outcomes like depression, anxiety, harassment, and suicide, among others. This has led to the need to employ machine learning techniques to automatically detect cyberbullying and prevent them on various social media platforms. In this study, we want to analyze the combination of some Natural Language Processing (NLP) algorithms (such as Bag-of-Words and TFIDF) with some popular machine learning …


Xgboost Hyperberd Model Using Steam Platform, Yuh-Haur Chen Jan 2024

Xgboost Hyperberd Model Using Steam Platform, Yuh-Haur Chen

Data Science and Data Mining

This project investigates game pricing strategies in the Steam market using an XGBoost model, drawing motivation from Professor Xie's lecture, and presenting findings through a density plot that delineates two primary pricing strategies. A free-to-play approach, indicated by a significant hot spot, is adopted by developers focusing on post-purchase revenues through DLC, aesthetic purchases, and in-game transactions. This sailing strategy includes community-centric developers aiming to distribute their games for player engagement rather than profit.

The project illustrates the effectiveness of advanced modeling techniques in handling complex datasets, with significant predictive accuracy reflected by a reduced MSE from 0.3472 to 0.1397. …


Predicting Road Accident Injury Severity For Drivers In Automobile Crashes In United States Using Machine Learning Models And Ai, Emil Agbemade, Benedict Kongyir Jan 2024

Predicting Road Accident Injury Severity For Drivers In Automobile Crashes In United States Using Machine Learning Models And Ai, Emil Agbemade, Benedict Kongyir

Data Science and Data Mining

This study analyzes data from the National Highway Trafc Safety Administration’s 2021 Crash Report Sampling System to identify key factors contributing to the severity of injuries in car accidents. By utilizing various machine learning algorithms and cross-validation techniques, we assessed metrics such as accuracy, sensitivity, precision, specifcity, and the area under the curve (AUC) to evaluate the efectiveness of predictive models. All data preprocessing and model building was done using KNIME Analytical software [9]. Our fndings reveal signifcant correlations between certain variables such as airbag injection, weather conditions, intoxication, vehicle state, driver distractions, and injury severity. These insights underscore the …


Bootstrap Regression For Investigating Macroeconomics Factors Affecting Usa Home Prices, Benedict Kongyir, Emil Agbemade Jan 2024

Bootstrap Regression For Investigating Macroeconomics Factors Affecting Usa Home Prices, Benedict Kongyir, Emil Agbemade

Data Science and Data Mining

This study investigates the impact of macroeconomic indicators on US home prices, underscoring the importance of understanding these dynamics due to their signifcant socioeconomic consequences. Utilizing a dataset from Kaggle, originally collected by FRED, the research examines variables like the Consumer Price Index, Population, Unemployment, GDP, Stock Prices, Income, and Mortgage Rate to discern their efect on housing market fuctuations. The analysis identifes multicollinearity among predictors, necessitating a shift from traditional multiple linear regression to a more robust bootstrap regression method due to violations of parametric assumptions. Key fndings reveal that Real Disposable Income is a signifcant predictor of home …


Modeling Health Insurance Premium Using Bayesian Hierarchical Models, Bennedict Kongyir, Emil Agbemade Jan 2024

Modeling Health Insurance Premium Using Bayesian Hierarchical Models, Bennedict Kongyir, Emil Agbemade

Data Science and Data Mining

Insurance pricing requires pragmatism and creativity due to the unpredictable nature of risk [3]. This paper explores Bayesian hierarchical models to model health insurance premiums using individual and group predictors like demographics, health status, and geography. Data from Kaggle on health insurance policyholders was utilized, with prior distributions enhanc­ing model interpretability and credibility. Bayesian models improve predictive accuracy and provide valuable insights for actuaries and policymakers, highlighting the signifcant impact of factors such as age and BMI on premium pricing.


Linear Regression With Regularization On The Genetic Architecture Of Maize Flowering Time, Roland Fiagbe May 2023

Linear Regression With Regularization On The Genetic Architecture Of Maize Flowering Time, Roland Fiagbe

Data Science and Data Mining

Over a century, the maize crop has been one of the most important crop species that is targeted for genetic investigations and experiments. One of the major experiments that have been a topic of interest is crossing inbred lines to produce better offspring through a process called heterosis. Crossing the inbred lines create numerous SNP markers that determine the time to male flowering. This project seeks to explore the SNP markers to select the most relevant ones for predicting time to male flowering using linear regression with regularization methods due to the fact that p > n in our dataset. Various …


A Recommender System For Movie Ratings With Matrix Factorization Algorithm, Amir Alipour Yengejeh May 2023

A Recommender System For Movie Ratings With Matrix Factorization Algorithm, Amir Alipour Yengejeh

Data Science and Data Mining

Nowadays, a Recommender System is a technology
that aims to predict preferences based on the user’s selections.
These systems are applied in numerous fields, such as movies,
music, news, books, research articles, search queries, social tags,
and various products. In this study, we use this potential tool to
predict the ratings of users’ preferences in MovieLens datasets. To
do so, we applied the matrix factorization algorithm and calculate
the RMSE as our evaluation metric. The results represent that
RMSE estimated for the train and test set are 0.83 and 0.93 that
are very close one another. This results indicates that …


Genome-Wide Association Study Of The Maize Crop By The Lasso Regression Analysis, Amir Alipour Yengejeh May 2023

Genome-Wide Association Study Of The Maize Crop By The Lasso Regression Analysis, Amir Alipour Yengejeh

Data Science and Data Mining

The accurate estimation of the male flowering period in Maize crops is key for the prediction crop fertility. The recent scientific investigations has shown that the genetic single nucleotic polymorphism (SNP) can contribute in this regard. The genomewide association study (GWAS) is employed to generate these attributes (SNP). But it caused a high-dimensional data in which 4,981 observations with 7,389 SNP attributes. Hence, in this study, we used the penalized regression approach with the least absolute shrinkage and selection operator (Lasso) to reduce the dataset. In this regard, we set the regularization parameter to 0.21. It resulted in a set …


Analysis Of Credit Approval By Decision Tree, Amir Alipour Yengejeh May 2023

Analysis Of Credit Approval By Decision Tree, Amir Alipour Yengejeh

Data Science and Data Mining

Nowadays, machine learning algorithms are com-
monly used by the financial institutions or bankers to evaluate
the applications’ requires for credit card. In this study, we used
the decision tree algorithm to predict credit card approval based
on the other associated features applicants like age, employment
status, Education Level, etc. Our results shows that the applicants’
Prior Default and Debt, and Employed have more contribution
in the credit card approval.


Movie Recommender System Using Matrix Factorization, Roland Fiagbe May 2023

Movie Recommender System Using Matrix Factorization, Roland Fiagbe

Data Science and Data Mining

Recommendation systems are a popular and beneficial field that can help people make informed decisions automatically. This technique assists users in selecting relevant information from an overwhelming amount of available data. When it comes to movie recommendations, two common methods are collaborative filtering, which compares similarities between users, and content-based filtering, which takes a user’s specific preferences into account. However, our study focuses on the collaborative filtering approach, specifically matrix factorization. Various similarity metrics are used to identify user similarities for recommendation purposes. Our project aims to predict movie ratings for unwatched movies using the MovieLens rating dataset. We developed …


Classification Of Adult Income Using Decision Tree, Roland Fiagbe Jan 2023

Classification Of Adult Income Using Decision Tree, Roland Fiagbe

Data Science and Data Mining

Decision tree is a commonly used data mining methodology for performing classification tasks. It is a tree-based supervised machine learning algorithm that is used to classify or make predictions in a path of how previous questions are answered. Generally, the decision tree algorithm categorizes data into branch-like segments that develop into a tree that contains a root, nodes, and leaves. This project seeks to explore the decision tree methodology and apply it to the Adult Income dataset from the UCI Machine Learning Repository, to determine whether a person makes over 50K per year and determine the necessary factors that improve …


A Linear Regression Model To Predict The Critical Temperature Of A Superconductor, Amir Alipour Yengejeh Jan 2023

A Linear Regression Model To Predict The Critical Temperature Of A Superconductor, Amir Alipour Yengejeh

Data Science and Data Mining

Since the superconductivity has been introduced, almost all studies in this area have been striving to predict the critical temperature ($T_{c}$) through the features extracted from the superconductor's chemical formula. In this study, thus, we are interested in exploring the linear association between $T_{c}$ and the related features.


Variable Selection Using Lasso And Elastic Net Regression On High Dimensional Genetic Architecture Data Of Maize Flowering Time, Pradip Dhakal Jan 2023

Variable Selection Using Lasso And Elastic Net Regression On High Dimensional Genetic Architecture Data Of Maize Flowering Time, Pradip Dhakal

Data Science and Data Mining

Variable selection is one of the key components in the machine learning area. This method reduces the unwanted and redundant predictors in the model, which prevents the overfitting situation. Since the model contains few significant predictors, the model is less likely to learn the trend from the noise. Further, the time to train the model reduces when we have only a few valuable variables.


Variable Selection And Regression Analysis, Emil Agbemade Jan 2023

Variable Selection And Regression Analysis, Emil Agbemade

Data Science and Data Mining

One of the most valuable crop species, maize, has been the subject of genetic study and experimentation for more than a century. However, species that share similarities and differences across a wide spectrum have developed astonishing adaptations as a result of small changes throughout time. Because it is usual practice to determine the genotypes of thousands of single nucleotide polymorphism (SNP) markers for thousands of patients, the data set we are dealing with has an issue with small n and large p. The result of this is that there are noticeably more predictor factors than responder variables. The original data …


Developing A Data-Driven Statistical Model For Accurately Predicting The Superconducting Critical Temperature Of Materials Using Multiple Regression And Gradient-Boosted Methods, Emil Agbemade Jan 2023

Developing A Data-Driven Statistical Model For Accurately Predicting The Superconducting Critical Temperature Of Materials Using Multiple Regression And Gradient-Boosted Methods, Emil Agbemade

Data Science and Data Mining

This study focuses on developing a statistical model for estimating the superconducting critical temperature (Tc) of materials using a data-driven strategy. The study analyzed 21,263 superconductors and used a combination of multiple regression and gradient-boosted models to make predictions. The analysis included a descriptive analysis of the distribution of Tc, feature selection using the Backwards selection method, and model diagnostics. The results showed that the gradient-boosted method outperformed the multiple linear regression method with an RMSE of 12.01 and an R2 value of 88.23 after fine-tuning its hyperparameters. The study concludes that the gradient-boosted method is an effective approach …


Machine Learning-Based Approaches For Predicting The Critical Temperature Of Superconductor, Pradip Dhakal Jan 2023

Machine Learning-Based Approaches For Predicting The Critical Temperature Of Superconductor, Pradip Dhakal

Data Science and Data Mining

This paper focuses on utilizing multiple linear regression, lasso regression, and extreme gradient boosting algorithms to predict the critical temperature of the superconductor. The model will be evaluated using the mean square error and adjusted R-squared values, and the best model will be recommended for future work related to this study.


Predicting Heart Disease Using Tree-Based Model, Emil Agbemade Jan 2023

Predicting Heart Disease Using Tree-Based Model, Emil Agbemade

Data Science and Data Mining

The paper presents a study on the use of machine learning algorithms for the prediction of heart disease, which is the leading cause of death worldwide. The study focuses on the use of decision tree algorithms, which have the advantage of considering a large number of risk factors. The heart disease data set was obtained from the UCI Machine Learning Repository and was analyzed using a decision tree classifier. The data set had 6 missing data points, which were deleted, leaving 279 instances for analysis. One-hot-encoding was performed on categorical variables with more than two responses. The decision tree classifier …


Silent Agony: Automated Detection Of Ethnic And Religious Cyberbullying Using Machine Learning, Emil Agbemade Jan 2023

Silent Agony: Automated Detection Of Ethnic And Religious Cyberbullying Using Machine Learning, Emil Agbemade

Data Science and Data Mining

The use of electronic mobile devices, social media, and networking websites has increased tremendously in recent years. Despite the advantages of these systems, such as exchanging ideas and information, being sociable, and providing entertainment, users may encounter adverse behaviors like toxicity, bullying, extremism, and cruelty. The prevalence of such behaviors has grown significantly in cyberspace, posing a threat to individuals and communities. To address this issue, there is a high demand for automated cyberbullying detection systems. Machine learning algorithms have been widely used to build such systems by classifying and detecting cyberbullying. In this study, we employed popular machine learning …


Analyzing The Impact Of Health, Economic, And Demographic Factors On Life Expectancy: A Comparative Study Of Developed And Developing Countries, Mahyar Alinejad Jan 2023

Analyzing The Impact Of Health, Economic, And Demographic Factors On Life Expectancy: A Comparative Study Of Developed And Developing Countries, Mahyar Alinejad

Data Science and Data Mining

This study presents a comprehensive analysis of three prominent machine learning regression models—Random Forest, XGBoost, and Support Vector Machine (SVM)—in the context of predictive analysis. Leveraging a carefully curated dataset, we explore the impact of various hyperparameters on model performance through an exhaustive tuning process. The Random Forest and XGBoost models exhibit robust predictive capabilities, with the former revealing notable insights through feature importance visualization. Additionally, SVM, optimized via GridSearchCV, demonstrates competitive performance. Evaluation metrics, including Mean Squared Error and R-squared, facilitate a thorough comparison of model efficacy. Results highlight nuanced strengths and weaknesses, informing practitioners on the suitability of …