Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 13 of 13

Full-Text Articles in Physical Sciences and Mathematics

Reading Pdfs Using Adversarially Trained Convolutional Neural Network Based Optical Character Recognition, Michael B. Brewer, Michael Catalano, Yat Leung, David Stroud Dec 2020

Reading Pdfs Using Adversarially Trained Convolutional Neural Network Based Optical Character Recognition, Michael B. Brewer, Michael Catalano, Yat Leung, David Stroud

SMU Data Science Review

A common problem that has plagued companies for years is digitizing documents and making use of the data contained within. Optical Character Recognition (OCR) technology has flooded the market, but companies still face challenges productionizing these solutions at scale. Although these technologies can identify and recognize the text on the page, they fail to classify the data to the appropriate datatype in an automated system that uses OCR technology as its data mining process. The research contained in this paper presents a novel framework for the identification of datapoints on check stub images by utilizing generative adversarial networks (GANs) to …


Topic Modeling To Understand Technology Talent, Chad Madding, Allen Ansari, Chris Ballenger, Aswini Thota Sep 2020

Topic Modeling To Understand Technology Talent, Chad Madding, Allen Ansari, Chris Ballenger, Aswini Thota

SMU Data Science Review

Attracting technology talent in today’s hiring climate is more complicated than ever. Recruiting for technology talent in non-technology industries is even more challenging. This intense hiring landscape is motivating companies not only to attract the right talent but also to create a culture that can retain and grow that talent. In this paper, we developed algorithms and present insights that use data provided in reviews to glean information employers can use to address or even change their priorities to meet the demands of an ever-changing job market. The core of our research is to investigate and attribute the role of …


Cover Song Identification - A Novel Stem-Based Approach To Improve Song-To-Song Similarity Measurements, Lavonnia Newman, Dhyan Shah, Chandler Vaughn, Faizan Javed Sep 2020

Cover Song Identification - A Novel Stem-Based Approach To Improve Song-To-Song Similarity Measurements, Lavonnia Newman, Dhyan Shah, Chandler Vaughn, Faizan Javed

SMU Data Science Review

Music is incorporated into our daily lives whether intentional or unintentional. It evokes responses and behavior so much so there is an entire study dedicated to the psychology of music. Music creates the mood for dancing, exercising, creative thought or even relaxation. It is a powerful tool that can be used in various venues and through advertisements to influence and guide human reactions. Music is also often "borrowed" in the industry today. The practices of sampling and remixing music in the digital age have made cover song identification an active area of research. While most of this research is focused …


Time Series Analysis Of Offshore Buoy Light Detection And Ranging (Lidar) Windspeed Data, Aditya Garapati, Charles J. Henderson, Carl Walenciak, Brian T. Waite Sep 2020

Time Series Analysis Of Offshore Buoy Light Detection And Ranging (Lidar) Windspeed Data, Aditya Garapati, Charles J. Henderson, Carl Walenciak, Brian T. Waite

SMU Data Science Review

In this paper, modeling techniques for the forecasting of wind speed using historical values observed by Light Detection and Ranging (LIDAR) sensors in an offshore context are described. Both univariate time series and multivariate time series modeling techniques leveraging meteorological data collected simultaneously with the LIDAR data are evaluated for potential contributions to predictive ability. Accurate and timely ability to predict wind values is essential to the effective integration of wind power into existing power grid systems. It allows for both the management of rapid ramp-up / down of base production capacity due to highly variable wind power inputs and …


Toxic Language Detection Using Robust Filters, Deepti Kunupudi, Shantanu Godbole, Pankaj Kumar, Suhas Pai Sep 2020

Toxic Language Detection Using Robust Filters, Deepti Kunupudi, Shantanu Godbole, Pankaj Kumar, Suhas Pai

SMU Data Science Review

Social networks sometimes become a medium for threats, insults, and other types of cyberbullying. A large number of people are involved in online social networks. Hence, the protection of network users from anti-social behavior is a critical activity [19]. One of the significant tasks of such activity is the detection of toxic language. Abusive/Toxic language in user-generated online content has become an issue of increasing importance in recent years. Most current commercial methods use blacklists and regular expressions; however, these measures fall short when contending with more subtle, lesser-known examples of hate speech, profanity, or swearing[6]. Abusive language classification has …


Reducing Age Bias In Machine Learning: An Algorithmic Approach, Adriana Solange Garcia De Alford, Steven K. Hayden, Nicole Wittlin, Amy Atwood Sep 2020

Reducing Age Bias In Machine Learning: An Algorithmic Approach, Adriana Solange Garcia De Alford, Steven K. Hayden, Nicole Wittlin, Amy Atwood

SMU Data Science Review

In this paper, we study the prevalence of bias in machine learning; we explore the life cycle phases where bias is potentially introduced into a machine learning model; and lastly, we present how adversarial learning can be leveraged to measure unwanted bias and unfair behavior from a machine learning algorithm. This study focuses particularly on the topics of age bias in predicting employee attrition and presents a practical approach for how adversarial learning can be successful in mitigating age bias. To measure bias, we calculate group fairness metrics across five-year age groups and evaluate fairness between a baseline predictive model …


Forecasting Spare Parts Sporadic Demand Using Traditional Methods And Machine Learning - A Comparative Study, Bhuvana Adur Kannan, Ganesh Kodi, Oscar Padilla, Dough Gray, Barry C. Smith Sep 2020

Forecasting Spare Parts Sporadic Demand Using Traditional Methods And Machine Learning - A Comparative Study, Bhuvana Adur Kannan, Ganesh Kodi, Oscar Padilla, Dough Gray, Barry C. Smith

SMU Data Science Review

Sporadic demand presents a particular challenge to traditional time forecasting methods. In the past 50 years, there has been developments, such as, the Croston Model [3], which has improved forecast performance. With the rise of Machine Learning (ML) there is abundant research in the field of applying ML algorithms to predict sporadic demand [8][12][9]. However, most existing research has analyzed this problem from the demand side [17]. In this paper, we tackle this predictive analytics challenge from the supply side. We perform a comparative analysis utilizing a spare parts demand dataset from an Original Equipment Manufacturer (OEM). Since traditional measurements …


Floor Regularization And Investigation Of Transfer Learning Through Sharing Of Probability Distribution Parameters, Daniel Byrne, Stacey Smith, Joanna Duran, John Santerre Sep 2020

Floor Regularization And Investigation Of Transfer Learning Through Sharing Of Probability Distribution Parameters, Daniel Byrne, Stacey Smith, Joanna Duran, John Santerre

SMU Data Science Review

In this work we introduce a simple new regularization technique, aptly named Floor, which drops low weight connections on every forward pass whenever they fall below a specified event horizon threshold. We compare the results of this technique side by side on identical network architectures between regular Dropout and Floor algorithms. We report similar or improved regularization, with the Floor algorithm versus regular Dropout and/or in concert with regular Dropout.

In this paper we also describe our research into transfer learning by sharing of probability distribution parameters in which we investigated methods of transferring Gaussian prior parameters derived from the …


The Transcript Profile Changes With Developmental Maturation Of Fetal Lung Type 2 Cells: An Analysis Of Rnaseq Data, Heber C. Nielsen, Volodymyr Orlov, Rebecca Holsapple, Monnie Mcgee Aug 2020

The Transcript Profile Changes With Developmental Maturation Of Fetal Lung Type 2 Cells: An Analysis Of Rnaseq Data, Heber C. Nielsen, Volodymyr Orlov, Rebecca Holsapple, Monnie Mcgee

SMU Data Science Review

In this paper, we utilize next-generation sequencing (NGS) data from the LungMap project to identify and characterize the developmental RNA transcriptome in alveolar epithelial type II cells of embryonic mouse lungs of gestational ages embryonic days 16 (E16) and 18 (E18). Late gestation lung cellular maturation is necessary for survival at birth. Using R and the BioConductor packages for RNAseq analysis, we analyze changes in the mouse lung RNA transcriptome as this maturation process takes place. We particularly identify the cluster of genes whose expression changes markedly between immature (E16) and mature (E18) lungs which can be used to define …


Forecasting Power Consumption In Pennsylvania During The Covid-19 Pandemic: A Sarimax Model With External Covid-19 And Unemployment Variables, Jackson Au, Javier Saldaña Jr., Ben Spanswick, John Santerre Aug 2020

Forecasting Power Consumption In Pennsylvania During The Covid-19 Pandemic: A Sarimax Model With External Covid-19 And Unemployment Variables, Jackson Au, Javier Saldaña Jr., Ben Spanswick, John Santerre

SMU Data Science Review

In this paper, we present how electrical consumption can reveal insight into the novel COVID-19 pandemic spread. We analyze electrical power consumption provided by PPL Electric Utilities, Department of Labor’s unemployment claims, and the COVID-19 cases/deaths for the State of Pennsylvania to study the impact of the pandemic on the infrastructure. Using a SARIMA model as our benchmark and we analyzed the use of a SARIMAX model to forecast the power consumption in Pennsylvania 14 days ahead. Our work quantifies and illuminates the effect that the strict legislation passed to minimize the spread of COVID19 had a on power consumption. …


Compressed Dna Representation For Efficient Amr Classification, John Partee, Robert Hazell, Anjli Solsi, John Santerre Aug 2020

Compressed Dna Representation For Efficient Amr Classification, John Partee, Robert Hazell, Anjli Solsi, John Santerre

SMU Data Science Review

In this paper, we explore a representation methodology for the compression of DNA isolates. Using lossless string compression via tokenization of frequently repeated segments of DNA, we reduce the length of the isolates to be counted as k-mers for classification. With this new representation, we apply a previously established feature sampling method to dramatically reduce the feature space. In understanding the genetic diversity, we also look at conserving biological function across these spaces. Using a random forest model we were able to predict the resistance or susceptibility of bacteria with 85-90\% accuracy, with a 30-50\% reduction in overall isolate length, …


Spoken Language Recognition On Open-Source Datasets, Brady Arendale, Samira Zarandioon, Ryan Goodwin, Douglas Reynolds Aug 2020

Spoken Language Recognition On Open-Source Datasets, Brady Arendale, Samira Zarandioon, Ryan Goodwin, Douglas Reynolds

SMU Data Science Review

The field of speaker and language recognition is constantly being researched and developed, but much of this research is done on private or expensive datasets, making the field more inaccessible than many other areas of machine learning. In addition, many papers make performance claims without comparing their models to other recent research. With the recent development of public multilingual speech corpora such as Mozilla's Common Voice as well as several single-language corpora, we now have the resources to attempt to address both of these problems. We construct an eight-language dataset from Common Voice and a Google Bengali corpus as well …


Predicting Attrition - A Driver For Creating Value, Realizing Strategy, And Refining Key Hr Processes, Kevin Mendonsa, Maureen Stolberg, Vivek Viswanathan, Scott Crum Aug 2020

Predicting Attrition - A Driver For Creating Value, Realizing Strategy, And Refining Key Hr Processes, Kevin Mendonsa, Maureen Stolberg, Vivek Viswanathan, Scott Crum

SMU Data Science Review

Talent is the most important asset for every organization's success. While attrition (or churn) and turnover can refer to both employees and customers, this paper will focus on employee attrition only. Many organizations accept attrition as an inevitable cost of doing business and do nothing to adopt or implement mitigating strategies to combat it. World class companies on the other hand take deliberate measures to understand, control and mitigate attrition (turnover) at every stage. Unmitigated attrition can have a devastating effect on an organization's bottom line and market value. In addition, the “invisible" costs of low employee morale, reduced employee …