Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 19 of 19

Full-Text Articles in Physical Sciences and Mathematics

Phishing Detection Using Natural Language Processing And Machine Learning, Apurv Mittal, Dr Daniel Engels, Harsha Kommanapalli, Ravi Sivaraman, Taifur Chowdhury Sep 2022

Phishing Detection Using Natural Language Processing And Machine Learning, Apurv Mittal, Dr Daniel Engels, Harsha Kommanapalli, Ravi Sivaraman, Taifur Chowdhury

SMU Data Science Review

Phishing emails are a primary mode of entry for attackers into an organization. A successful phishing attempt leads to unauthorized access to sensitive information and systems. However, automatically identifying phishing emails is often difficult since many phishing emails have composite features such as body text and metadata that are nearly indistinguishable from valid emails. This paper presents a novel machine learning-based framework, the DARTH framework, that characterizes and combines multiple models, with one model for each composite feature, that enables the accurate identification of phishing emails. The framework analyses each composite feature independently utilizing a multi-faceted approach using Natural Language …


Classification Of Breast Cancer Histopathological Images Using Semi-Supervised Gans, Balaji Avvaru, Nibhrat Lohia, Sowmya Mani, Vijayasrikanth Kaniti Sep 2022

Classification Of Breast Cancer Histopathological Images Using Semi-Supervised Gans, Balaji Avvaru, Nibhrat Lohia, Sowmya Mani, Vijayasrikanth Kaniti

SMU Data Science Review

Breast cancer is diagnosed more frequently than skin cancer in women in the United States. Most breast cancer cases are diagnosed in women, while children and men are less likely to develop the disease. Various tissues in the breast grow uncontrollably, resulting in breast cancer. Different treatments analyze microscopic histopathology images for diagnosis that help accurately detect cancer cells. Deep learning is one of the evolving techniques to classify images where accuracy depends on the volume and quality of labeled images. This study used various pre-trained models to train the histopathological images and analyze these models to create a new …


Short Term Forecasting Of Solar Radiation, Ashwin Thota, Bradley Blanchard, Lijju Mathew, Paritosh Rai, Sid Swarupananda Sep 2022

Short Term Forecasting Of Solar Radiation, Ashwin Thota, Bradley Blanchard, Lijju Mathew, Paritosh Rai, Sid Swarupananda

SMU Data Science Review

This paper details how to predict solar radiation at a location for the next few hours using machine learning techniques like Facebook’s Prophet, and Amazon’s DeepAR+. Multiple techniques like AutoRegressive (ARIMA) and Exponential Smoothing (ES) have been used to forecast solar radiation, but they lack accuracy and are not scalable. Whereas Prophet, and Amazon’s DeepAR+ are scalable, accurate, and easily integrated into other machine learning techniques. This will be the first time where the combination of these techniques along with Linear Regression, Random Forest, XGBoost and Decision Tree will be leveraged to forecast solar radiation for the short term. Predicting …


Using Natural Language Processing To Increase Modularity And Interpretability Of Automated Essay Evaluation And Student Feedback, Chris Roche, Nathan Deinlein, Darryl Dawkins, Faizan Javed Sep 2022

Using Natural Language Processing To Increase Modularity And Interpretability Of Automated Essay Evaluation And Student Feedback, Chris Roche, Nathan Deinlein, Darryl Dawkins, Faizan Javed

SMU Data Science Review

For English teachers and students who are dissatisfied with the one-size-fits-all approach of current Automated Essay Scoring (AES) systems, this research uses Natural Language Processing (NLP) techniques that provide a focus on configurability and interpretability. Unlike traditional AES models which are designed to provide an overall score based on pre-trained criteria, this tool allows teachers to tailor feedback based upon specific focus areas. The tool implements a user-interface that serves as a customizable rubric. Students’ essays are inputted into the tool either by the student or by the teacher via the application’s user-interface. Based on the rubric settings, the tool …


Stock Forecasts With Lstm And Web Sentiment, Michael Burgess, Faizan Javed, Nnenna Okpara, Chance Robinson Sep 2022

Stock Forecasts With Lstm And Web Sentiment, Michael Burgess, Faizan Javed, Nnenna Okpara, Chance Robinson

SMU Data Science Review

Traditional time-series techniques, such as auto-regressive and moving average models, can have difficulties when applied to stock data due to the randomness inherent to the markets. In this study, Long Short-Term Memory Recurrent Neural Networks, or LSTMs, have been applied to pricing data along with sentiment scores derived from web sources such as Twitter and other financial media outlets. The project team utilized this approach to complement the technical indicators observed at the end of each trading day for three stocks from the NASDAQ stock exchange over a 12-year span. A common benchmark to assess model performance on time series …


Predicting Insulin Pump Therapy Settings, Riccardo L. Ferraro, David Grijalva, Alex Trahan Sep 2022

Predicting Insulin Pump Therapy Settings, Riccardo L. Ferraro, David Grijalva, Alex Trahan

SMU Data Science Review

Millions of people live with diabetes worldwide [7]. To mitigate some of the many symptoms associated with diabetes, an estimated 350,000 people in the United States rely on insulin pumps [17]. For many of these people, how effectively their insulin pump performs is the difference between sleeping through the night and a life threatening emergency treatment at a hospital. Three programmed insulin pump therapy settings governing effective insulin pump function are: Basal Rate (BR), Insulin Sensitivity Factor (ISF), and Carbohydrate Ratio (ICR). For many people using insulin pumps, these therapy settings are often not correct, given their physiological needs. While …


Classification Of Pixel Tracks To Improve Track Reconstruction From Proton-Proton Collisions, Kebur Fantahun, Jobin Joseph, Halle Purdom, Nibhrat Lohia Sep 2022

Classification Of Pixel Tracks To Improve Track Reconstruction From Proton-Proton Collisions, Kebur Fantahun, Jobin Joseph, Halle Purdom, Nibhrat Lohia

SMU Data Science Review

In this paper, machine learning techniques are used to reconstruct particle collision pathways. CERN (Conseil européen pour la recherche nucléaire) uses a massive underground particle collider, called the Large Hadron Collider or LHC, to produce particle collisions at extremely high speeds. There are several layers of detectors in the collider that track the pathways of particles as they collide. The data produced from collisions contains an extraneous amount of background noise, i.e., decays from known particle collisions produce fake signal. Particularly, in the first layer of the detector, the pixel tracker, there is an overwhelming amount of background noise that …


Cov-Inception: Covid-19 Detection Tool Using Chest X-Ray, Aswini Thota, Ololade Awodipe, Rashmi Patel Sep 2022

Cov-Inception: Covid-19 Detection Tool Using Chest X-Ray, Aswini Thota, Ololade Awodipe, Rashmi Patel

SMU Data Science Review

Since the pandemic started, researchers have been trying to find a way to detect COVID-19 which is a cost-effective, fast, and reliable way to keep the economy viable and running. This research details how chest X-ray radiography can be utilized to detect the infection. This can be for implementation in Airports, Schools, and places of business. Currently, Chest imaging is not a first-line test for COVID-19 due to low diagnostic accuracy and confounding with other viral pneumonia. Different pre-trained algorithms were fine-tuned and applied to the images to train the model and the best model obtained was fine-tuned InceptionV3 model …


Predicting Twitch.Tv Donations Using Sentiment Analysis, Alexander J. Gilbert, Jason Herbaugh, Feby Cheruvathoor, Ben Williams, Alex Tozzo Sep 2022

Predicting Twitch.Tv Donations Using Sentiment Analysis, Alexander J. Gilbert, Jason Herbaugh, Feby Cheruvathoor, Ben Williams, Alex Tozzo

SMU Data Science Review

Twitch.tv streamers have a rare opportunity to receive immediate feedback from their audience through a real-time chat log that is rife with sentiment information. Tools that can help a streamer understand how they need to influence their audience can be useful in increasing the donations and subscriptions they earn. Although millions around the world stream on Twitch, only a minuscule fraction of these streamers earn a living streaming alone. This paper aimed to provide muchneeded guidance to enable more streamers to succeed. We used stream logs, known as VODs (video on demand), which can be easily accessed through Twitch’s API …


Hierarchical Neural Networks (Hnn): Using Tensorflow To Build Hnn, Rick Fontenot, Joseph Lazarus, Puri Rudick, Anthony Sgambellone Sep 2022

Hierarchical Neural Networks (Hnn): Using Tensorflow To Build Hnn, Rick Fontenot, Joseph Lazarus, Puri Rudick, Anthony Sgambellone

SMU Data Science Review

This research demonstrates the use of TensorFlow to build a Hierarchical Neural Network (HNN). Constructing and engineering neural networks to maximize accuracy and efficiency is an active field of research in machine learning. HNN, along with several other applications of split networks have been developed as recently as 2017. However, implementations thus far have required custom-built and coded HNNs. The research conducted here uses TensorFlow to validate this structure by building entirely separate neural nets with logical relations between the output of one net and the inputs of the nets that are downstream. Research has shown that Hierarchical Neural Networks …


Examining Bias In Jury Selection For Criminal Trials In Dallas County, Megan Ball, Brandon Birmingham, Matt Farrow, Katherine Mitchell, Bivin Sadler, Lynne Stokes Sep 2022

Examining Bias In Jury Selection For Criminal Trials In Dallas County, Megan Ball, Brandon Birmingham, Matt Farrow, Katherine Mitchell, Bivin Sadler, Lynne Stokes

SMU Data Science Review

One of the hallmarks of the American judicial system is the concept of trial by jury, and for said trial to consist of an impartial jury of your peers. Several landmark legal cases in the history of the United States have challenged this notion of equal representation by jury—most notably Batson v. Kentucky, 476 U.S. 79 (1986). Most of the previous research, focus, and legal precedence has centered around peremptory challenges and attempting to prove if bias was suspected in excluding certain jurors from serving. Few studies, however, focus on examining challenges for cause based on self-reported biases from the …


Application Of Probabilistic Ranking Systems On Women’S Junior Division Beach Volleyball, Cameron Stewart, Michael Mazel, Bivin Sadler Sep 2022

Application Of Probabilistic Ranking Systems On Women’S Junior Division Beach Volleyball, Cameron Stewart, Michael Mazel, Bivin Sadler

SMU Data Science Review

Women’s beach volleyball is one of the fastest growing collegiate sports today. The increase in popularity has come with an increase in valuable scholarship opportunities across the country. With thousands of athletes to sort through, college scouts depend on websites that aggregate tournament results and rank players nationally. This project partnered with the company Volleyball Life, who is the current market leader in the ranking space of junior beach volleyball players. Utilizing the tournament information provided by Volleyball Life, this study explored replacements to the current ranking systems, which are designed to aggregate player points from recent tournament placements. Three …


Market Segmentation And Recency Frequency Monetary Value Analysis For A Freemium Mobile Game, Satvik Ajmera, Taylor Bonar, Dylan Scott, Carol Miu, Alana Manuel Sep 2022

Market Segmentation And Recency Frequency Monetary Value Analysis For A Freemium Mobile Game, Satvik Ajmera, Taylor Bonar, Dylan Scott, Carol Miu, Alana Manuel

SMU Data Science Review

Bricks ‘N Balls is a freemium game that relies on in-app purchases and ad monetization from users to be profitable at no upfront cost to the players. This study explores how in-game data analytics and purchase data can be used to segment players. Features taken into consideration for segmentation include past purchasing habits along with the players interactions within the missions. This study uses the Recency Frequency Monetary Value (RFM) framework to extract insights on player purchasing behavior to segment players into clusters and predict how much users will spend in the future.


Using Hospital Bed Capacity Prediction During Covid-19 To Determine Feature Importance, Helene Barrera, Justin Ehly, Blake Freeman, Chris Papesh, Brad Blanchard Jun 2022

Using Hospital Bed Capacity Prediction During Covid-19 To Determine Feature Importance, Helene Barrera, Justin Ehly, Blake Freeman, Chris Papesh, Brad Blanchard

SMU Data Science Review

The COVID-19 pandemic has exacerbated existing hospital capacity limitations in the United States, causing hospitals in certain regions to hit maximum capacity. The purpose of this study is to investigate key features of COVID-19 related admissions to help create a higher level of public understanding and help guide healthcare management professionals and governments when considering preventive measures. The introduction of preventative measures and new regulations during the pandemic have led to the generation of multiple types of models and feature selection methods in the field of Machine Learning that are increasingly complicated. This study focuses on the exploration of feature …


A Machine Learning Approach To Revenue Generation Within The Professional Hair Care Industry, Alexander K. Sepenu, Linda Eliasen Jun 2022

A Machine Learning Approach To Revenue Generation Within The Professional Hair Care Industry, Alexander K. Sepenu, Linda Eliasen

SMU Data Science Review

The cosmetic and beauty industry continues to grow and evolve to satisfy its patrons. In the United States, the industry is heavily science-driven, innovative, and fast-paced, suggesting that to remain productive and profitable, companies must seek smart alternatives to their current modus operandi or risk losing out on this multi-billion-dollar industry to fierce competition. In this paper, the authors seek to utilize machine learning models such as clustering and regression to improve the efficiency of current sales and customer segmentation models to help HairCo (pseudonym for confidentiality), a professional hair products manufacturer, strategize their marketing and sales efforts for revenue …


Analysis Of The Electric Power Outage Data And Prediction Of Electric Power Outage For Major Metropolitan Areas In Texas Using Machine Learning And Time Series Methods, Renfeng Wang, Venkata Leela 'Mg' Vanga, Zachary B. Zaiken, Jonathan Bennett Jun 2022

Analysis Of The Electric Power Outage Data And Prediction Of Electric Power Outage For Major Metropolitan Areas In Texas Using Machine Learning And Time Series Methods, Renfeng Wang, Venkata Leela 'Mg' Vanga, Zachary B. Zaiken, Jonathan Bennett

SMU Data Science Review

With growing energy usage, power outages affect millions of households. This case study focuses on gathering power outage historical data, modifying the data to attach weather attributes, and gathering ERCOT energy market conditions for Dallas-Fort Worth and Houston metropolitan areas of Texas. The transformed data is then analyzed using machine learning algorithms including, but not limited to, Regression, Random Forests and XGBoost to consider current weather and ERCOT features and predict power outage percentage for locations. The transformed data is also trained using time series models and serially correlated models including Autoregression and Vector Autoregression. This study also focuses on …


Web Page Multiclass Classification, Brian Gaither, Antonio Debouse, Catherine Huang Jun 2022

Web Page Multiclass Classification, Brian Gaither, Antonio Debouse, Catherine Huang

SMU Data Science Review

As the internet age evolves, the volume of content hosted on the Web is rapidly expanding. With this ever-expanding content, the capability to accurately categorize web pages is a current challenge to serve many use cases. This paper proposes a variation in the approach to text preprocessing pipeline whereby noun phrase extraction is performed first followed by lemmatization, contraction expansion, removing special characters, removing extra white space, lower casing, and removal of stop words. The first step of noun phrase extraction is aimed at reducing the set of terms to those that best describe what the web pages are about …


Anomaly Detection Methods To Improve Supply Chain Data Quality And Operations, Ana E. Glaser, Jake P. Harrison, David Josephs Jun 2022

Anomaly Detection Methods To Improve Supply Chain Data Quality And Operations, Ana E. Glaser, Jake P. Harrison, David Josephs

SMU Data Science Review

Supply chain operations drive the planning, manufacture, and distribution of billions of semiconductors a year, spanning thousands of products across many supply chain configurations. The customizations span from wafer technology to die stacking and chip feature enablement. Data quality drives efficiency in these processes and anomalies in data can be very disruptive, and at times, consequential. Developing preventative measures that automate the detection of anomalies before they reach downstream execution systems would result in significant efficiency gain for the organization. The purpose of this research is to identify an effective, actionable, and computationally efficient approach to highlight anomalies in a …


Adjusting Community Survey Data Benchmarks For External Factors, Allen Miller, Nicole M. Norelli, Robert Slater, Mingyang N. Yu Jun 2022

Adjusting Community Survey Data Benchmarks For External Factors, Allen Miller, Nicole M. Norelli, Robert Slater, Mingyang N. Yu

SMU Data Science Review

Abstract. Using U.S. resident survey data from the National Community Survey in combination with public data from the U.S. Census and additional sources, a Voting Regressor Model was developed to establish fair benchmark values for city performance. These benchmarks were adjusted for characteristics the city cannot easily influence that contribute to confidence in local government, such as population size, demographics, and income. This adjustment allows for a more meaningful comparison and interpretation of survey results among individual cities. Methods explored for the benchmark adjustment included cluster analysis, anomaly detection, and a variety of regression techniques, including random forest, ridge, decision …