Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 24 of 24

Full-Text Articles in Physical Sciences and Mathematics

Overcoming Small Data Limitations In Heart Disease Prediction By Using Surrogate Data, Alfeo Sabay, Laurie Harris, Vivek Bejugama, Karen Jaceldo-Siegl Aug 2018

Overcoming Small Data Limitations In Heart Disease Prediction By Using Surrogate Data, Alfeo Sabay, Laurie Harris, Vivek Bejugama, Karen Jaceldo-Siegl

SMU Data Science Review

In this paper, we present a heart disease prediction use case showing how synthetic data can be used to address privacy concerns and overcome constraints inherent in small medical research data sets. While advanced machine learning algorithms, such as neural networks models, can be implemented to improve prediction accuracy, these require very large data sets which are often not available in medical or clinical research. We examine the use of surrogate data sets comprised of synthetic observations for modeling heart disease prediction. We generate surrogate data, based on the characteristics of original observations, and compare prediction accuracy results achieved from …


Fake News Detection: A Deep Learning Approach, Aswini Thota, Priyanka Tilak, Simrat Ahluwalia, Nibrat Lohia Aug 2018

Fake News Detection: A Deep Learning Approach, Aswini Thota, Priyanka Tilak, Simrat Ahluwalia, Nibrat Lohia

SMU Data Science Review

Fake news is defined as a made-up story with an intention to deceive or to mislead. In this paper we present the solution to the task of fake news detection by using Deep Learning architectures. Gartner research [1] predicts that “By 2022, most people in mature economies will consume more false information than true information”. The exponential increase in production and distribution of inaccurate news presents an immediate need for automatically tagging and detecting such twisted news articles. However, automated detection of fake news is a hard task to accomplish as it requires the model to understand nuances in natural …


Random Forest Vs Logistic Regression: Binary Classification For Heterogeneous Datasets, Kaitlin Kirasich, Trace Smith, Bivin Sadler Aug 2018

Random Forest Vs Logistic Regression: Binary Classification For Heterogeneous Datasets, Kaitlin Kirasich, Trace Smith, Bivin Sadler

SMU Data Science Review

Selecting a learning algorithm to implement for a particular application on the basis of performance still remains an ad-hoc process using fundamental benchmarks such as evaluating a classifier’s overall loss function and misclassification metrics. In this paper we address the difficulty of model selection by evaluating the overall classification performance between random forest and logistic regression for datasets comprised of various underlying structures: (1) increasing the variance in the explanatory and noise variables, (2) increasing the number of noise variables, (3) increasing the number of explanatory variables, (4) increasing the number of observations. We developed a model evaluation tool capable …


Predicting National Basketball Association Success: A Machine Learning Approach, Adarsh Kannan, Brian Kolovich, Brandon Lawrence, Sohail Rafiqi Aug 2018

Predicting National Basketball Association Success: A Machine Learning Approach, Adarsh Kannan, Brian Kolovich, Brandon Lawrence, Sohail Rafiqi

SMU Data Science Review

In this paper, we present a machine learning based approach to projecting the success of National Basketball Association (NBA) draft prospects. With the proliferation of data, analytics have increasingly be- come a critical component in the assessment of professional and collegiate basketball players. We leverage player biometric data, college statistics, draft selection order, and positional breakdown as modelling features in our prediction algorithms. We found that a player's draft pick and their college statistics are the best predictors of their longevity in the National Basketball Association.


Minimizing The Perceived Financial Burden Due To Cancer, Hassan Azhar, Zoheb Allam, Gino Varghese, Daniel W. Engels, Sajiny John Aug 2018

Minimizing The Perceived Financial Burden Due To Cancer, Hassan Azhar, Zoheb Allam, Gino Varghese, Daniel W. Engels, Sajiny John

SMU Data Science Review

In this paper, we present a regression model that predicts perceived financial burden that a cancer patient experiences in the treatment and management of the disease. Cancer patients do not fully understand the burden associated with the cost of cancer, and their lack of understanding can increase the difficulties associated with living with the disease, in particular coping with the cost. The relationship between demographic characteristics and financial burden were examined in order to better understand the characteristics of a cancer patient and their burden, while all subsets regression was used to determine the best predictors of financial burden. Age, …


Yelp’S Review Filtering Algorithm, Yao Yao, Ivelin Angelov, Jack Rasmus-Vorrath, Mooyoung Lee, Daniel W. Engels Aug 2018

Yelp’S Review Filtering Algorithm, Yao Yao, Ivelin Angelov, Jack Rasmus-Vorrath, Mooyoung Lee, Daniel W. Engels

SMU Data Science Review

In this paper, we present an analysis of features influencing Yelp's proprietary review filtering algorithm. Classifying or misclassifying reviews as recommended or non-recommended affects average ratings, consumer decisions, and ultimately, business revenue. Our analysis involves systematically sampling and scraping Yelp restaurant reviews. Features are extracted from review metadata and engineered from metrics and scores generated using text classifiers and sentiment analysis. The coefficients of a multivariate logistic regression model were interpreted as quantifications of the relative importance of features in classifying reviews as recommended or non-recommended. The model classified review recommendations with an accuracy of 78%. We found that reviews …


Cryptocurrency Price Prediction Using Tweet Volumes And Sentiment Analysis, Jethin Abraham, Daniel Higdon, John Nelson, Juan Ibarra Aug 2018

Cryptocurrency Price Prediction Using Tweet Volumes And Sentiment Analysis, Jethin Abraham, Daniel Higdon, John Nelson, Juan Ibarra

SMU Data Science Review

In this paper, we present a method for predicting changes in Bitcoin and Ethereum prices utilizing Twitter data and Google Trends data. Bitcoin and Ethereum, the two largest cryptocurrencies in terms of market capitalization represent over \$160 billion dollars in combined value. However, both Bitcoin and Ethereum have experienced significant price swings on both daily and long term valuations. Twitter is increasingly used as a news source influencing purchase decisions by informing users of the currency and its increasing popularity. As a result, quickly understanding the impact of tweets on price direction can provide a purchasing and selling advantage to …


Goalie Analytics: Statistical Evaluation Of Context-Specific Goalie Performance Measures In The National Hockey League, Marc Naples, Logan Gage, Amy Nussbaum Jul 2018

Goalie Analytics: Statistical Evaluation Of Context-Specific Goalie Performance Measures In The National Hockey League, Marc Naples, Logan Gage, Amy Nussbaum

SMU Data Science Review

In this paper, we attempt to improve upon the classic formulation of save percentage in the NHL by controlling the context of the shots and use alternative measures than save percentage. In particular, we find save percentage to be both a weakly repeatable skill and predictor of future performance, and we seek other goalie performance calculations that are more robust. To do so, we use three primary tests to test intra-season consistency, intra-season predictability, and inter-season consistency, and extend the analysis to disentangle team effects on goalie statistics. We find that there are multiple ways to improve upon classic save …


Fuel Flow Reduction Impact Analysis Of Drag Reducing Film Applied To Aircraft Wings, Damon Resnick, Chris Donlan, Nimish Sakalle, Cody Pinkerman Jul 2018

Fuel Flow Reduction Impact Analysis Of Drag Reducing Film Applied To Aircraft Wings, Damon Resnick, Chris Donlan, Nimish Sakalle, Cody Pinkerman

SMU Data Science Review

In this paper, we present an analysis of flight data in order to determine whether the application of the Edge Aerodynamix Conformal Vortex Generator (CVG), applied to the wings of aircraft, reduces fuel flow during cruising conditions of flight. The CVG is a special treatment and film applied to the wings of an aircraft to protect the wings and reduce the non-laminar flow of air around the wings during flight. It is thought that by reducing the non-laminar flow or vortices around and directly behind the wings that an aircraft will move more smoothly through the air and provide a …


Data Center Application Security: Lateral Movement Detection Of Malware Using Behavioral Models, Harinder Pal Singh Bhasin, Elizabeth Ramsdell, Albert Alva, Rajiv Sreedhar, Medha Bhadkamkar Jul 2018

Data Center Application Security: Lateral Movement Detection Of Malware Using Behavioral Models, Harinder Pal Singh Bhasin, Elizabeth Ramsdell, Albert Alva, Rajiv Sreedhar, Medha Bhadkamkar

SMU Data Science Review

Data center security traditionally is implemented at the external network access points, i.e., the perimeter of the data center network, and focuses on preventing malicious software from entering the data center. However, these defenses do not cover all possible entry points for malicious software, and they are not 100% effective at preventing infiltration through the connection points. Therefore, security is required within the data center to detect malicious software activity including its lateral movement within the data center. In this paper, we present a machine learning-based network traffic analysis approach to detect the lateral movement of malicious software within the …


Predictions Generated From A Simulation Engine For Gene Expression Micro-Arrays For Use In Research Laboratories, Gopinath R. Mavankal, John Blevins, Dominique Edwards, Monnie Mcgee, Andrew Hardin Jul 2018

Predictions Generated From A Simulation Engine For Gene Expression Micro-Arrays For Use In Research Laboratories, Gopinath R. Mavankal, John Blevins, Dominique Edwards, Monnie Mcgee, Andrew Hardin

SMU Data Science Review

In this paper we introduce the technical components, the biology and data science involved in the use of microarray technology in biological and clinical research. We discuss how laborious experimental protocols involved in obtaining this data used in laboratories could benefit from using simulations of the data. We discuss the approach used in the simulation engine from [7]. We use this simulation engine to generate a prediction tool in Power BI, a Microsoft, business intelligence tool for analytics and data visualization [22]. This tool could be used in any laboratory using micro-arrays to improve experimental design by comparing how predicted …


Data Scientist’S Analysis Toolbox: Comparison Of Python, R, And Sas Performance, Jim Brittain, Mariana Cendon, Jennifer Nizzi, John Pleis Jul 2018

Data Scientist’S Analysis Toolbox: Comparison Of Python, R, And Sas Performance, Jim Brittain, Mariana Cendon, Jennifer Nizzi, John Pleis

SMU Data Science Review

A quantitative analysis will be performed on experiments utilizing three different tools used for Data Science. The analysis will include replication of analysis along with comparisons of code length, output, and results. Qualitative data will supplement the quantitative findings. The conclusion will provide data support guidance on the correct tool to use for common situations in the field of Data Science.


Predicting Game Day Outcomes In National Football League Games, Josh Klein, Anna Frowein, Chris Irwin Jul 2018

Predicting Game Day Outcomes In National Football League Games, Josh Klein, Anna Frowein, Chris Irwin

SMU Data Science Review

In this paper, we present a model for predicting the game day outcomes of National Football League games. 3 of the most popular sources for game day predictions are analyzed for comparison. Player data and outcomes from previous games are used, but we also incorporate several weather factors into our models. Over 1,700 games were incorporated and 3 separate models are created using simple regression, principal component analysis, and a recursive model. We also discuss the ethicality of using data science techniques by individuals with the knowledge in order to gain an advantage over a population lacking this specialized training.


Supervised Machine Learning Bot Detection Techniques To Identify Social Twitter Bots, Phillip George Efthimion, Scott Payne, Nicholas Proferes Jul 2018

Supervised Machine Learning Bot Detection Techniques To Identify Social Twitter Bots, Phillip George Efthimion, Scott Payne, Nicholas Proferes

SMU Data Science Review

In this paper, we present novel bot detection algorithms to identify Twitter bot accounts and to determine their prevalence in current online discourse. On social media, bots are ubiquitous. Bot accounts are problematic because they can manipulate information, spread misinformation, and promote unverified information, which can adversely affect public opinion on various topics, such as product sales and political campaigns. Detecting bot activity is complex because many bots are actively trying to avoid detection. We present a novel, complex machine learning algorithm utilizing a range of features including: length of user names, reposting rate, temporal patterns, sentiment expression, followers-to-friends ratio, …


Cryptovisor: A Cryptocurrency Advisor Tool, Matthew Baldree, Paul Widhalm, Brandon Hill, Matteo Ortisi Jul 2018

Cryptovisor: A Cryptocurrency Advisor Tool, Matthew Baldree, Paul Widhalm, Brandon Hill, Matteo Ortisi

SMU Data Science Review

In this paper, we present a tool that provides trading recommendations for cryptocurrency using a stochastic gradient boost classifier trained from a model labeled by technical indicators. The cryptocurrency market is volatile due to its infancy and limited size making it difficult for investors to know when to enter, exit, or stay in the market. Therefore, a tool is needed to provide investment recommendations for investors. We developed such a tool to support one cryptocurrency, Bitcoin, based on its historical price and volume data to recommend a trading decision for today or past days. This tool is 95.50% accurate with …


Case Study: Using Crime Data And Open Source Data To Design A Police Patrol Area, Brent Allen Jul 2018

Case Study: Using Crime Data And Open Source Data To Design A Police Patrol Area, Brent Allen

SMU Data Science Review

This case study examines how to use existing crime data augmented with open source data to design a patrol area. We used the a demand signal of "calls for service" vice reports which summarize calls for service. Additionally, we augmented our existing data with traffic data from Google Maps. Traffic delays did not correspond to traffic incidents reported in the area examined. These data were plotted geographically to aid in the determination of the new patrol area. The new patrol area was created around natural geographic boundaries, the density of calls for service and police operational experience.


Machine Learning To Predict College Course Success, Anthony R.Y. Dalton, Justin Beer, Sriharshasai Kommanapalli, James S. Lanich Ph.D. Jul 2018

Machine Learning To Predict College Course Success, Anthony R.Y. Dalton, Justin Beer, Sriharshasai Kommanapalli, James S. Lanich Ph.D.

SMU Data Science Review

In this paper, we present an analysis of the predictive ability of machine learning on the success of students in college courses in a California Community College. The California Legislature passed assembly bill 705 in order to place students in non-remedial coursework, based on high school transcripts, to increase college completion. We utilize machine learning methods on de-identified student high school transcript data to create predictive algorithms on whether or not the student will be successful in college-level English and Mathematics coursework. To satisfy the bill’s requirements, we first use exploratory data analysis on applicable transcript variables. Then we use …


Seismology And Volcanology: Exploration Of Volcanoes, Long-Periods, And Machines - Predicting Volcano Eruption Using Signature Seismic Data, Kyle Killion, Rajeev Kumar, Celia J. Taylor, Gabriele Morra Apr 2018

Seismology And Volcanology: Exploration Of Volcanoes, Long-Periods, And Machines - Predicting Volcano Eruption Using Signature Seismic Data, Kyle Killion, Rajeev Kumar, Celia J. Taylor, Gabriele Morra

SMU Data Science Review

Abstract. Seismo-volcanologists manually isolate and verify long-period waves and Strombolian events using seismic and acoustic waves. This is a very detailed and time-consuming process. This project is to employ machine learning algorithms to find models which locate long-period and Strombolian signatures automatically. By comparing the timing of seismic and acoustic waves, clustering techniques effectively isolated big volcanic events and aided in the further refinement of techniques to capture the hundreds of typical daily Strombolian events at Villarrica volcano. Within the research, we utilized the unsupervised machine learning environment to locate a group of signatures for customizing machine learned long-period signature …


Comparative Study Of Deep Learning Models For Network Intrusion Detection, Brian Lee, Sandhya Amaresh, Clifford Green, Daniel Engels Apr 2018

Comparative Study Of Deep Learning Models For Network Intrusion Detection, Brian Lee, Sandhya Amaresh, Clifford Green, Daniel Engels

SMU Data Science Review

In this paper, we present a comparative evaluation of deep learning approaches to network intrusion detection. A Network Intrusion Detection System (NIDS) is a critical component of every Internet connected system due to likely attacks from both external and internal sources. A NIDS is used to detect network born attacks such as Denial of Service (DoS) attacks, malware replication, and intruders that are operating within the system. Multiple deep learning approaches have been proposed for intrusion detection systems. We evaluate three models, a vanilla deep neural net (DNN), self-taught learning (STL) approach, and Recurrent Neural Network (RNN) based Long Short …


Walknet: A Deep Learning Approach To Improving Sidewalk Quality And Accessibility, Andrew Abbott, Alex Deshowitz, Dennis Murray, Eric C. Larson Apr 2018

Walknet: A Deep Learning Approach To Improving Sidewalk Quality And Accessibility, Andrew Abbott, Alex Deshowitz, Dennis Murray, Eric C. Larson

SMU Data Science Review

This paper proposes a framework for optimizing allocation of infrastructure spending on sidewalk improvement and allowing planners to focus their budgets on the areas in the most need. In this research, we identify curb ramps from Google Street View images using traditional machine learning and deep learning methods. Our convolutional neural network approach achieved an 83% accuracy and high level of precision when classifying curb cuts. We found that as the model received more data, the accuracy increased, which with the continued collection of crowdsourced labeling of curb cuts will increase the model’s classification power. We further investigated a model …


Cognitive Virtual Admissions Counselor, Kumar Raja Guvindan Raju, Cory Adams, Raghuram Srinivas Apr 2018

Cognitive Virtual Admissions Counselor, Kumar Raja Guvindan Raju, Cory Adams, Raghuram Srinivas

SMU Data Science Review

Abstract. In this paper, we present a cognitive virtual admissions counselor for the Master of Science in Data Science program at Southern Methodist University. The virtual admissions counselor is a system capable of providing potential students accurate information at the time that they want to know it. After the evaluation of multiple technologies, Amazon’s LEX was selected to serve as the core technology for the virtual counselor chatbot. Student surveys were leveraged to collect and generate training data to deploy the natural language capability. The cognitive virtual admissions counselor platform is currently capable of providing an end-to-end conversational dialog to …


Comparative Study: Reducing Cost To Manage Accessibility With Existing Data, Claire Chu, Bill Kerneckel, Eric C. Larson, Nathan Mowat, Christopher Woodard Apr 2018

Comparative Study: Reducing Cost To Manage Accessibility With Existing Data, Claire Chu, Bill Kerneckel, Eric C. Larson, Nathan Mowat, Christopher Woodard

SMU Data Science Review

“Project Sidewalk” is an existing research effort that focuses on mapping accessibility issues for handicapped persons to efficiently plan wheelchair and mobile scooter friendly routes around Washington D.C. As supporters of this project, we utilized the data “Project Sidewalk” collected and used it to confirm predictions about where problem sidewalks exist based on real estate and crime data. We present a study that identifies correlations found between accessibility data and crime and housing statistics in the Washington D.C. metropolitan area. We identify the key reasons for increased accessibility and the issues with the current infrastructure management system. After a thorough …


Blockchain In Payment Card Systems, Darlene Godfrey-Welch, Remy Lagrois, Jared Law, Russell Scott Anderwald, Daniel W. Engels Apr 2018

Blockchain In Payment Card Systems, Darlene Godfrey-Welch, Remy Lagrois, Jared Law, Russell Scott Anderwald, Daniel W. Engels

SMU Data Science Review

Payment cards (e.g., credit and debit cards) are the most frequent form of payment in use today. A payment card transaction entails many verification information exchanges between the cardholder, merchant, issuing bank, a merchant bank, and third-party payment card processors. Today, a record of the payment transaction often records to multiple ledgers. Merchant’s incur fees for both accepting and processing payment cards. The payment card industry is in dire need of technology which removes the need for third-party verification and records transaction details to a single tamper-resistant digital ledger. The private blockchain is that technology. Private blockchain provides a linked …


A Dynamic Hierarchical Network Topology To Reduce Interference In User-Rich Lans, Ian Johnson, Erik Gabrielsen, Danh Nguyen, Gavin Pham, Alex Saladna, Travis Siems Jan 2018

A Dynamic Hierarchical Network Topology To Reduce Interference In User-Rich Lans, Ian Johnson, Erik Gabrielsen, Danh Nguyen, Gavin Pham, Alex Saladna, Travis Siems

SMU Journal of Undergraduate Research

In this paper we present a greedy approach to create hi- erarchical network topologies for throughput optimization in single access point to create hierarchical network topolo- gies. By minimizing electromagnetic interference, we opti- mize throughput by creating topologies where the probabil- ity of a collision occurring is low. We evaluate a series of greedy topology algorithms based on the average through- put of the resulting network. We conclude that hierarchical network topologies generated with greedy algorithms signif- icantly outperform networks with simple star topologies by up to 75%.