Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

2021

Discipline
Institution
Keyword
Publication
Publication Type
File Type

Articles 1 - 30 of 367

Full-Text Articles in Data Science

On Performance Optimization And Prediction Of Parallel Computing Frameworks In Big Data Systems, Haifa Alquwaiee Dec 2021

On Performance Optimization And Prediction Of Parallel Computing Frameworks In Big Data Systems, Haifa Alquwaiee

Dissertations

A wide spectrum of big data applications in science, engineering, and industry generate large datasets, which must be managed and processed in a timely and reliable manner for knowledge discovery. These tasks are now commonly executed in big data computing systems exemplified by Hadoop based on parallel processing and distributed storage and management. For example, many companies and research institutions have developed and deployed big data systems on top of NoSQL databases such as HBase and MongoDB, and parallel computing frameworks such as MapReduce and Spark, to ensure timely data analyses and efficient result delivery for decision making and business …


Private And Federated Deep Learning: System, Theory, And Applications For Social Good, Han Hu Dec 2021

Private And Federated Deep Learning: System, Theory, And Applications For Social Good, Han Hu

Dissertations

During the past decade, drug abuse continues to accelerate towards becoming the most severe public health problem in the United States. The ability to detect drug­abuse risk behavior at a population scale, such as among the population of Twitter users, can help to monitor the trend of drug­abuse incidents. However, traditional methods do not effectively detect drug­abuse risk behavior in tweets, mainly due to the sparsity of such tweets and the noisy nature of tweets. In the first part of this dissertation work, the task of classifying tweets as containing drug­abuse risk behavior or not, is studied. Millions of public …


A Comparison Of K-Means And Agglomerative Clustering For Users Segmentation Based On Question Answerer Reputation In Brainly Platform, Puji Winar Cahyo, Landung Sudarmana Dec 2021

A Comparison Of K-Means And Agglomerative Clustering For Users Segmentation Based On Question Answerer Reputation In Brainly Platform, Puji Winar Cahyo, Landung Sudarmana

Elinvo (Electronics, Informatics, and Vocational Education)

Brainly is a question and answer (Q&A) site that students can use as a media for questions and answers. Students can also use Brainly to find and share educational information that helps students solve their homework problems. In Brainly, users can answer questions according to their interests. However, it could be that the interest is not necessarily following the competencies possessed. It causes many answers to the questions given not to have a high rating because the answers given are of low quality to be prioritized as the main answer. This study aims to apply the K-Means and Agglomerative Clustering …


A Novel Arabic Corpus For Text Classification Using Deep Learning And Word Embedding, Roua A. Abou Khachfeh, Islam El Kabani, Ziad Osman Dec 2021

A Novel Arabic Corpus For Text Classification Using Deep Learning And Word Embedding, Roua A. Abou Khachfeh, Islam El Kabani, Ziad Osman

BAU Journal - Science and Technology

Over the last years, Natural Language Processing (NLP) for Arabic language has obtained increasing importance due to the massive textual information available online in an unstructured text format, and its capability in facilitating and making information retrieval easier. One of the widely used NLP task is “Text Classification”. Its goal is to employ machine learning technics to automatically classify the text documents into one or more predefined categories. An important step in machine learning is to find suitable and large data for training and testing an algorithm. Moreover, Deep Learning (DL), the trending machine learning research, requires a lot of …


Prediction Of Body Fat Percentage Based On Anthropometric Measurements Using Data Mining Approach, Hamsa Amro, Prof. Mohammed Awad Dec 2021

Prediction Of Body Fat Percentage Based On Anthropometric Measurements Using Data Mining Approach, Hamsa Amro, Prof. Mohammed Awad

Journal of the Arab American University مجلة الجامعة العربية الامريكية للبحوث

In recent years, heart disease, diabetes, and some types of cancers have been reported as some main causes of death in most countries of the world, and obesity, which is often attributed to excess body fat, is one of the most common risk factors for these diseases. To make the vast amounts of data produced by health care information systems useful to the potential, the researchers applied knowledge discovery through predictive modeling. This study used anthropometric measurements as input data to different data mining techniques to predict body fat percentage. Fisher’s Method of Scoring was used to select the most …


Application Of Competitive Intelligence For Insular Territories: Automatic Analysis Of Scientific And Technology Trends To Fight The Negative Effects Of Climate Change, Henri Dou, Pierre Fournie Dec 2021

Application Of Competitive Intelligence For Insular Territories: Automatic Analysis Of Scientific And Technology Trends To Fight The Negative Effects Of Climate Change, Henri Dou, Pierre Fournie

International Journal of Islands Research

Islands are fragile territories because of their geographical position. As a result, climate impacts can have serious consequences, of which some are irreversible. Therefore, it is necessary to allow insular territories to benefit from the latest scientific and technological advances in combating climate effects. The current article shows how to deal with automatic analysis of scientific information on the one hand, but also its applications via patents. We will analyse the latest scientific results as well as their possible applications using patent analysis. We will also focus on experts, laboratories, and leading companies, that are active on the field. The …


Physics-Informed Machine Learning To Predict Extreme Weather Events, Rthvik Raviprakash, Jonathan Buchanan, Mahdi Bu Ali Dec 2021

Physics-Informed Machine Learning To Predict Extreme Weather Events, Rthvik Raviprakash, Jonathan Buchanan, Mahdi Bu Ali

Discovery Undergraduate Interdisciplinary Research Internship

Extreme weather events refer to unexpected, severe, or unseasonal weather events, which are dynamically related to specific large-scale atmospheric patterns. These extreme weather events have a significant impact on human society and also natural ecosystems. For example, natural disasters due to extreme weather events caused more than $90 billion global direct losses in 2015. These extreme weather events are challenging to predict due to the chaotic nature of the atmosphere and are highly correlated with the occurrence of atmospheric blocking. A key aspect for preparedness and response to extreme climate events is accurate medium-range forecasting of atmospheric blocking events.

Unlike …


A Distance-Based Clustering Framework For Categorical Time Series: A Case Study In Episodes Of Care Healthcare Delivery System, Lauren Staples Dec 2021

A Distance-Based Clustering Framework For Categorical Time Series: A Case Study In Episodes Of Care Healthcare Delivery System, Lauren Staples

Doctor of Data Science and Analytics Dissertations

Understanding how compensation structures influence overall healthcare costs is a central issue in health economics. Episodes of Care (EoC) is a compensation structure that bundles payments for healthcare interventions that belong to a well-defined health event. Since the variation of clinical pathways can drive the cost of healthcare, this research uses sequences of medical billing codes in Perinatal Episodes of Care claims data to study the extent of that variation by equating it to the number of reproducible clusters found. This research proposes a methodological framework to detect reproducible clusters in an unsupervised problem where the true number of clusters …


“Transitioning Organisations From A Data Quagmire To Knowledge Nirvana Through The Digital Thread”, David Twohig, Barry Heavey Dec 2021

“Transitioning Organisations From A Data Quagmire To Knowledge Nirvana Through The Digital Thread”, David Twohig, Barry Heavey

Level 3

Historically, organisations have managed product data in a combination of Microsoft Office, Sharepoint and Document Management Systems. In this paper, we explore how different technologies can be leveraged to create digital product profiles, and in doing so structure data to enable effective knowledge management.


Introduction To Using Python In The Digital Humanities, Elisabeth Shook Dec 2021

Introduction To Using Python In The Digital Humanities, Elisabeth Shook

Library Faculty Publications and Presentations

The materials here are from the Python for Digital Humanities Workshop taught on December 13, 2021 for the Boise State University Digital Humanities Group. This 3-hour workshop was created to provide both a very brief introduction to the various capabilities of Python and a small lesson in using Python to pull meaningful insight out of text files.


Aspect-Based Sentiment Analysis Of Movie Reviews, Samuel Onalaja, Eric Romero, Bosang Yun Dec 2021

Aspect-Based Sentiment Analysis Of Movie Reviews, Samuel Onalaja, Eric Romero, Bosang Yun

SMU Data Science Review

This study investigates a comparison of classification models used to determine aspect based separated text sentiment and predict binary sentiments of movie reviews with genre and aspect specific driving factors. To gain a broader classification analysis, five machine and deep learning algorithms were compared: Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), and Recurrent Neural Network Long-Short-Term Memory (RNN LSTM). The various movie aspects that are utilized to separate the sentences are determined through aggregating aspect words from lexicon-base, supervised and unsupervised learning. The driving factors are randomly assigned to various movie aspects and their impact tied to …


Reading Level Identification Using Natural Language Processing Techniques, William Arnost, Ellen Lull, Joseph Schueder, Joseph Engler Dec 2021

Reading Level Identification Using Natural Language Processing Techniques, William Arnost, Ellen Lull, Joseph Schueder, Joseph Engler

SMU Data Science Review

This paper investigates using the Bidirectional Encoder Representations from Transformers (BERT) algorithm and lexical-syntactic features to measure readability. Readability is important in many disciplines, for functions such as selecting passages for school children, assessing the complexity of publications, and writing documentation. Text at an appropriate reading level will help make communication clear and effective. Readability is primarily measured using well-established statistical methods. Recent advances in Natural Language Processing (NLP) have had mixed success incorporating higher-level text features in a way that consistently beats established metrics. This paper contributes a readability method using a modern transformer technique and compares the results …


Predicting Power Using Time Series Analysis Of Power Generation And Consumption In Texas, Joshua Eysenbach, Bodie Franklin, Andrew J. Larsen, Joel Lindsey Dec 2021

Predicting Power Using Time Series Analysis Of Power Generation And Consumption In Texas, Joshua Eysenbach, Bodie Franklin, Andrew J. Larsen, Joel Lindsey

SMU Data Science Review

Due to the recent power events in Texas, power forecasting has been brought national attention. Accurate demand forecasting is necessary to be sure that there is adequate power supply to meet consumer's needs. While Texas has a forecasting model created by the Electricity Reliability Council of Texas (ERCOT), constant efforts are required to ensure that the model stays at the state-of-the-art and is producing the most reliable forecasts possible. This research seeks to provide improved short- and medium-term forecasting models, bringing in state-of-the-art deep learning models to compare to ERCOT’s forecasts. A model that is more accurate than ERCOT’s own …


Emotion Integrated Music Recommendation System Using Generative Adversarial Networks, Mrinmoy Bhaumik, Patrica U. Attah, Faizan Javed Dec 2021

Emotion Integrated Music Recommendation System Using Generative Adversarial Networks, Mrinmoy Bhaumik, Patrica U. Attah, Faizan Javed

SMU Data Science Review

Music can stimulate emotions within us; hence is often called the “language of emotion.” This study explores emotion as an additional feature in generating a playlist with a deep learning model to improve the current music recommendation system. This study will sample emotions from certain subjects for each song in a sample of the data. Since the effect of music on emotion is subjective and is different person to person, this study would need a considerable number of subjects to reduce subjectivity. Due to the limited resources, a portion of the data will be labeled with emotion from subjects and …


Alternative Methods For Deriving Emotion Metrics In The Spotify® Recommendation Algorithm, Ronald M. Sherga Jr., David Wei, Neil Benson, Faizan Javed Dec 2021

Alternative Methods For Deriving Emotion Metrics In The Spotify® Recommendation Algorithm, Ronald M. Sherga Jr., David Wei, Neil Benson, Faizan Javed

SMU Data Science Review

Spotify's® recommendation algorithm tailors music offerings to create a unique listening experience for each user. Though what this recommender does is highly impressive, there is always room for improvement given that these techniques are not fully prescient. This study posits that in addition to creating certain features based on audio analysis, incorporating new features derived from album art color as well as lyrical sentiment analysis may provide additional value to the end user. This team did not find that a significant difference existed between color valence and Spotify® valence; however, all other comparisons resulted in statistically significant difference of means …


Clinical Diagnosis Support With Convolutional Neural Network By Transfer Learning, Spencer Fogleman, Jeremy Otsap, Sangrae Cho Dec 2021

Clinical Diagnosis Support With Convolutional Neural Network By Transfer Learning, Spencer Fogleman, Jeremy Otsap, Sangrae Cho

SMU Data Science Review

Breast cancer is prevalent among women in the United States. Breast cancer screening is standard but requires a radiologist to review screening images to make a diagnosis. Diagnosis through the traditional screening method of mammography currently has an accuracy of about 78% for women of all ages and demographics. A more recent and precise technique called Digital Breast Tomosynthesis (DBT) has shown to be more promising but is less well studied. A machine learning model trained on DBT images has the potential to increase the success of identifying breast cancer and reduce the time it takes to diagnose a patient, …


Covid-19 - A Graph Network Approach, Nibhrat Lohia, Rajesh Satluri, Suchismita Moharana, Venkat Kasarla Dec 2021

Covid-19 - A Graph Network Approach, Nibhrat Lohia, Rajesh Satluri, Suchismita Moharana, Venkat Kasarla

SMU Data Science Review

The effects of COVID-19 and its spreads are attributed to various factors. This study uses CDC open-source data on COVID-19 effected population with features ranging from location to ethnicity, to create a Knowledge Graph to measure the similarity between COVID-19 cases and estimate the risk for people likely affected by COVID-19. This data could be used to find correlations between distinct factors, like ethnicity and pre-existing health conditions, to find the vulnerability of a given COVID-19 patient. Using the Jaccard similarity coefficient, in the knowledge graph, we are able to identify and explore relationships between COVID-19 cases as well as …


Rocket Learn, Daanesh Ibrahim, Jules Stacy, David Stroud, Yusi Zhang Dec 2021

Rocket Learn, Daanesh Ibrahim, Jules Stacy, David Stroud, Yusi Zhang

SMU Data Science Review

Abstract. This paper covers the development, testing, and implementation of Reinforcement Learning methods designed to autonomously learn and optimize Rocket League play. This study aims to analyze and benchmark model frameworks commonly used in Reinforcement Learning applications. These models can be applied to tasks ranging in difficulty from simple to superhumanly complex, and this study will begin with and build upon simple models performing simple tasks. It will result in complex models performing difficult tasks. Models will be allowed to train autonomously on the game using mass parallelization to expedite training times with the goal of maximizing reward function scores. …


Pokégan: P2p (Pet To Pokémon) Stylizer, Michael B. Hedge, Morgan Nelson, Thomas Pengilly, Michael Weatherford Dec 2021

Pokégan: P2p (Pet To Pokémon) Stylizer, Michael B. Hedge, Morgan Nelson, Thomas Pengilly, Michael Weatherford

SMU Data Science Review

This paper covers the development, testing, and implementation of an automatic framework for converting common images of pets into a Pokémon cartoon with the style of a Pokémon trading card. The technique will first implement object detection for common animals to facilitate image segmentation and apply the appropriate style transfer model to ensure the most aesthetic stylization. It explores various methods to address artifacts in the results of common neural style transfer techniques using Generative Adversarial Networks (GANs). This research sets up a framework to create an app that converts user-submitted pet pictures to Pokémon styled images using the most …


Machine Learning Approach To Distinguish Ulcerative Colitis And Crohn’S Disease Using Smote (Synthetic Minority Oversampling Technique) Methods, Kris Ghimire, Walter Lai, Yasser Omar, Thad Schwebke, Jamie Vo Dec 2021

Machine Learning Approach To Distinguish Ulcerative Colitis And Crohn’S Disease Using Smote (Synthetic Minority Oversampling Technique) Methods, Kris Ghimire, Walter Lai, Yasser Omar, Thad Schwebke, Jamie Vo

SMU Data Science Review

Irritable Bowel Disease (IBD) affects a sizable portion of the US population, causing symptoms such as vomiting, abdominal pain, and diarrhea. Despite the disease’s prevalence, the precise cause is not fully understood. This study consists of endoscopic and histological data from patients diagnosed with IBD and a control population for reference. The machine learning models' focus is to classify patients into IBD types. Several models were analyzed, including decision trees, logistic regression, and k-nearest neighbors. In addition, various methods of SMOTE were applied to determine the most effective transformation and ensuring that the dataset is balanced. The best model with …


Urban Traffic Simulation: Network And Demand Representation Impacts On Congestion Metrics, Aaron Faltesek, Balasubramaniam Dakshinamoorthi, Sreeni Prabhala, Akbar Thobani, Anu Kuncheria, Jane Macfarlane Dec 2021

Urban Traffic Simulation: Network And Demand Representation Impacts On Congestion Metrics, Aaron Faltesek, Balasubramaniam Dakshinamoorthi, Sreeni Prabhala, Akbar Thobani, Anu Kuncheria, Jane Macfarlane

SMU Data Science Review

Traffic simulations are often used by city planners as a basis for predicting the impact of policies, plans, and operations. The complexities underpinning traffic simulations are often not described in detail yet can significantly impact the simulation outcome. Conflating underlying data for simulations is complex and hinders the interest in this type of exploration. This paper aims to elucidate critical features of traffic simulations that drive the generated metrics of the modeled urban environment. Specifically, this paper examines differences in two road graph networks for the metropolitan region of Houston, TX: a reduced network composed of 45,675 road links and …


Intelligent Investment Portfolio Management Using Time-Series Analytics And Deep Reinforcement Learning, Sachin Chavan, Pradeep Kumar, Tom Gianelle Dec 2021

Intelligent Investment Portfolio Management Using Time-Series Analytics And Deep Reinforcement Learning, Sachin Chavan, Pradeep Kumar, Tom Gianelle

SMU Data Science Review

Abstract. – With globalization, the capital markets have exploded in size and value, making them exceedingly difficult to predict. These days the public has access to real-time data of the market-leading to more participation. As a positive step, this might lead to better wealth distribution in the society, and it also adds to the random nature of the market, making it more unpredictable. The portfolio accounts consisting of stocks and bonds are considered serious investment assets. They can make or break a person’s future. It is also a way of shielding one against market risk or rising inflation. These accounts, …


Identifying Vacant Lots To Reduce Violent Crime In Dallas, Texas, Laura Lazarescou, Andrew Mejia, Tina Pai, Sabrina Purvis, Robert Slater, Owen Wilson-Chavez Dec 2021

Identifying Vacant Lots To Reduce Violent Crime In Dallas, Texas, Laura Lazarescou, Andrew Mejia, Tina Pai, Sabrina Purvis, Robert Slater, Owen Wilson-Chavez

SMU Data Science Review

Vacant lots have been associated with community violence for many years. Researchers have confirmed a positive correlation between vacant lots and vacant buildings with increased violence in urban and rural geographies. However, identifying vacant lots has been a challenge, and modeling methods were largely manual and time-intensive. This prevented cities and non-profit organizations from acting on the information since it was expensive and high-risk to develop remediation programs without clearly understanding where or how many vacant lots existed.

The primary objective of this study was to provide a predictive model that accelerates and improves the accuracy of prior land classification …


Identification And Characterization Of Forest Fire Risk Zones Leveraging Machine Learning Methods, Joshua Balson, Matt Chinchilla, Cam Lu, Jeff Washburn, Nibhrat Lohia Dec 2021

Identification And Characterization Of Forest Fire Risk Zones Leveraging Machine Learning Methods, Joshua Balson, Matt Chinchilla, Cam Lu, Jeff Washburn, Nibhrat Lohia

SMU Data Science Review

Across the United States, record numbers of wildfires are observed costing billions of dollars in property damage, polluting the environment, and putting lives at risk. The ability of emergency management professionals, city planners, and private entities such as insurance companies to determine if an area is at higher risk of a fire breaking out has never been greater. This paper proposes a novel methodology for identifying and characterizing zones with increased risks of forest fires. Methods involving machine learning techniques use the widely available and recorded data, thus making it possible to implement the tool quickly.


Qualitative Leveraging Natural Language Processing To Establish Judge Incrimination Statistics To Educate Voters In Re-Elections, Aurian Ghaemmaghami, Paul Huggins, Grace Lang, Julia Layne, Robert Slater Dec 2021

Qualitative Leveraging Natural Language Processing To Establish Judge Incrimination Statistics To Educate Voters In Re-Elections, Aurian Ghaemmaghami, Paul Huggins, Grace Lang, Julia Layne, Robert Slater

SMU Data Science Review

The prevalence of data has given consumers the power to make informed choices based off reviews, ratings, and descriptive statistics. However, when a local judge is coming up for re-election there is not any available data that aids voters in making data-driven decision on their vote. Currently court docket data is stored in text or PDFs with very little uniformity. Scaling the collection of this information could prove to be complicated and tiresome. There is a demand for an automated, intelligent system that can extract and organize useful information from the datasets. This paper covers the process of web scraping …


Uncertainty-Aware Deep Learning For Prediction Of Remaining Useful Life Of Mechanical Systems, Samuel J. Cornelius Dec 2021

Uncertainty-Aware Deep Learning For Prediction Of Remaining Useful Life Of Mechanical Systems, Samuel J. Cornelius

Theses and Dissertations

Remaining useful life (RUL) prediction is a problem that researchers in the prognostics and health management (PHM) community have been studying for decades. Both physics-based and data-driven methods have been investigated, and in recent years, deep learning has gained significant attention. When sufficiently large and diverse datasets are available, deep neural networks can achieve state-of-the-art performance in RUL prediction for a variety of systems. However, for end users to trust the results of these models, especially as they are integrated into safety-critical systems, RUL prediction uncertainty must be captured. This work explores an approach for estimating both epistemic and heteroscedastic …


Exploring The Impact Of Social Influence Mechanisms And Network Density On Societal Polarization, Justin Mittereder Dec 2021

Exploring The Impact Of Social Influence Mechanisms And Network Density On Societal Polarization, Justin Mittereder

Student Research Submissions

I present an agent-based model, inspired by the opinion dynamics
(OD) literature, to explore the underlying behaviors that may induce
societal polarization. My agents interact on a social network, in which
adjacent nodes can influence each other, and each agent holds an array
of continuous opinion values (on a 0-1 scale) on a number of separate
issues. I use three measures as a proxy for the virtual society’s “po-
larization:” the average assortativity of the graph with respect to the
agents’ opinions, the number of non-uniform issues, and the number
of distinct opinion buckets in which agents have the same …


Analyzing And Detecting Android Malware And Deepfake, Md Shohel Rana Dec 2021

Analyzing And Detecting Android Malware And Deepfake, Md Shohel Rana

Dissertations

Rapid advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) over the past several decades have produced a variety of technologies and tools that, among numerous cybersecurity issues, have enticed cybercriminals and hackers to design malware for the Android operating systems and/or manipulate multimedia. For example, high-quality and realistic fake videos, images, or audios have been created to spread misinformation and propaganda, foment political discord and hate, or even harass and blackmail people; these manipulated, high-quality and realistic videos became known recently as Deepfake. There has been much work done in recent years on malware analysis and …


Prediction Of Iraqi Stock Exchange Using Optimized Based-Neural Network, Ameer Al-Haq Al-Shamery, Prof. Dr. Eman Salih Al-Shamery Dec 2021

Prediction Of Iraqi Stock Exchange Using Optimized Based-Neural Network, Ameer Al-Haq Al-Shamery, Prof. Dr. Eman Salih Al-Shamery

Karbala International Journal of Modern Science

Stock market prediction is an interesting financial topic that has attracted the attention of researchers for the last years. This paper aims at improving the prediction of the Iraq-Stock-Exchange (ISX) using a developed method of feedforward Neural-Networks based on the Quasi-Newton optimization approach. The proposed method reduces the error factor depending on the Jacobian vector and Lagrange multiplier. This improvement has led to accelerating convergence during the learning process. A sample of companies listed on ISX was selected. This includes twenty-six banks for the years from 2010 to 2020. To evaluate the proposed model, the research findings are compared with …


The Detection Of Sexual Harassment And Chat Predators Using Artificial Neural Network, Noor Amer Hamzah, Ban N. Dhannoon Dec 2021

The Detection Of Sexual Harassment And Chat Predators Using Artificial Neural Network, Noor Amer Hamzah, Ban N. Dhannoon

Karbala International Journal of Modern Science

The vast increase in using social media sites like Twitter and Facebook led to frequent sexual_harassment on the Internet, which is considered a major societal problem. This paper aims to detect sexual_harassment and cyber_predators in early phase. We used deeplearning like Bidirectionally-long-short-term memory. Word representations are carefully reviewed in text specific to mapping to real number vectors. The chat sexual predators Detection_approach with the proposed_model. The best results obtained by the performance measured with F0.5-score were the result is_0.927 with proposed_models. The accuracy measured is_97.27% in the proposed_model. The comments sexual_harassment Detection_approach the result is_0.925 F0.5-score, and accuracy measured is_99.12%.