Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

400 Full-Text Articles 964 Authors 35,895 Downloads 103 Institutions

All Articles in Data Science

Faceted Search

400 full-text articles. Page 1 of 20.

Count Data Regression Analysis: Concepts, Overdispersion Detection, Zero-Inflation Identification, And Applications With R, Luiz Paulo Fávero, Rafael de Freitas Souza, Patrícia Belfiore, Hamilton Luiz Corrêa, Michel F. C. Haddad 2021 University of São Paulo

Count Data Regression Analysis: Concepts, Overdispersion Detection, Zero-Inflation Identification, And Applications With R, Luiz Paulo Fávero, Rafael De Freitas Souza, Patrícia Belfiore, Hamilton Luiz Corrêa, Michel F. C. Haddad

Practical Assessment, Research, and Evaluation

In this paper is proposed a straightforward model selection approach that indicates the most suitable count regression model based on relevant data characteristics. The proposed selection approach includes four of the most popular count regression models (i.e. Poisson, negative binomial, and respective zero-inflated frameworks). Moreover, it addresses two of the most relevant problems commonly found in real-world count datasets, namely overdispersion and zero-inflation. The entire selection approach may be performed using the programme language R, being all commands used throughout the paper availabe for practical purposes. It is worth mentioning that counting regression models are still not widespread within ...


Reimagining The Archive For Computational Analysis At Scale, Jamie Rogers 2021 Florida International University

Reimagining The Archive For Computational Analysis At Scale, Jamie Rogers

Works of the FIU Libraries

This presentation was part of a three-segment panel discussion sponsored by IS&T, the Society for Imaging Science and Technology, titled "OCR and Text Recognition: Workflows, Trends, and New Applications." This segment covers ways in which we have re-conceptualized archive materials as computationally useful data as well as the value of utilizing data at scale to impact research possibilities. We have been able to accomplish this through an ongoing project "dLOC as Data: A Thematic Approach to Caribbean Newspapers," a collaborative initiative between the Digital Library of the Caribbean, University of Florida, and Florida International University.


A Configurable Social Network For Running Irb-Approved Experiments, Mihovil Mandic 2021 Dartmouth College

A Configurable Social Network For Running Irb-Approved Experiments, Mihovil Mandic

Dartmouth College Undergraduate Theses

Our world has never been more connected, and the size of the social media landscape draws a great deal of attention from academia. However, social networks are also a growing challenge for the Institutional Review Boards concerned with the subjects’ privacy. These networks contain a monumental variety of personal information of almost 4 billion people, allow for precise social profiling, and serve as a primary news source for many users. They are perfect environments for influence operations that are becoming difficult to defend against. Motivated to study online social influence via IRB-approved experiments, we designed and implemented a flexible, scalable ...


Learn Biologically Meaningful Representation With Transfer Learning, Di He 2021 City University of New York (CUNY)

Learn Biologically Meaningful Representation With Transfer Learning, Di He

Dissertations, Theses, and Capstone Projects

Machine learning has made significant contributions to bioinformatics and computational biol­ogy. In particular, supervised learning approaches have been widely used in solving problems such as bio­marker identification, drug response prediction, and so on. However, because of the limited availability of comprehensively labeled and clean data, constructing predictive models in super­ vised settings is not always desirable or possible, especially when using data­hunger, red­hot learning paradigms such as deep learning methods. Hence, there are urgent needs to develop new approaches that could leverage more readily available unlabeled data in driving successful machine learning ap­ plications in this ...


Fine-Grained Detection Of Hate Speech Using Bertoxic, Yakoob Khan 2021 Dartmouth College

Fine-Grained Detection Of Hate Speech Using Bertoxic, Yakoob Khan

Dartmouth College Undergraduate Theses

This thesis describes our approach towards the fine-grained detection of hate speech using deep learning. We leverage the transformer encoder architecture to propose BERToxic, a system that fine-tunes a pre-trained BERT model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the prediction boundaries. The post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assigning a toxic label to words that have at least one token labeled as toxic. Through experiments, we show that these two post-processing steps improve the performance of our model by 4.16 ...


Lexical Complexity Prediction With Assembly Models, Aadil Islam 2021 Dartmouth College

Lexical Complexity Prediction With Assembly Models, Aadil Islam

Dartmouth College Undergraduate Theses

Tuning the complexity of one's writing is essential to presenting ideas in a logical, intuitive manner to audiences. This paper describes a system submitted by team BigGreen to LCP 2021 for predicting the lexical complexity of English words in a given context. We assemble a feature engineering-based model and a deep neural network model with an underlying Transformer architecture based on BERT. While BERT itself performs competitively, our feature engineering-based model helps in extreme cases, eg. separating instances of easy and neutral difficulty. Our handcrafted features comprise a breadth of lexical, semantic, syntactic, and novel phonetic measures. Visualizations of ...


Advancing The Ability To Predict Cognitive Decline And Alzheimer’S Disease Based On Genetic Variants Beyond Amyloid-Beta And Tau, Naveen Rawat 2021 San Jose State University

Advancing The Ability To Predict Cognitive Decline And Alzheimer’S Disease Based On Genetic Variants Beyond Amyloid-Beta And Tau, Naveen Rawat

Master's Projects

A growing amount of neurodegenerative R&D is focused on identifying genomic- based explanations of AD that are beyond Amyloid-b and Tau. The proposed effort involves identifying some of the genomic variations, such as single nucleotide polymorphisms (SNPs), allele , chromosome, epigenetic contributors to MCI and AD that are beyond Aβ and Tau.

The project involves building a prediction model based on a support vector machine (SVM) classifier that takes into account the genomic variations and epigenetic factors to predict the early stage of mild cognitive impairment (MCI) and Alzheimer disease (AD). To achieve this, picking up important feature sets which ...


Prediction Of Financial Capacity Using Diffusion Compartment Imaging, Lok Yi Tai 2021 San Jose State University

Prediction Of Financial Capacity Using Diffusion Compartment Imaging, Lok Yi Tai

Master's Projects

Financial Capacity (FC) is the ability to manage one’s financial affairs, which is essential for autonomy and independence particularly for aging adults. Since dementia develops gradually, it is often difficult to detect the early signs that this cognitive dysfunction is developing This project aims to use Neurite orientation dispersion and density imaging (NODDI) to identify the white matter tracts that are associated with FC. Diffusion Tensor Images (DTI) and T1 Magnetic Resonance Images (MRI) of 18 Alzheimer’s Disease (AD) subjects, 47 Mild Cognitive Impaired (MCI) subjects, and 193 healthy control (CN) are compared to neuropsychological tests. Orientation Dispersion ...


Spaceflight And The Differential Gene Expression Of Human Stem Cell-Derived Cardiomyocytes, Eugenie Zhu 2021 San Jose State University

Spaceflight And The Differential Gene Expression Of Human Stem Cell-Derived Cardiomyocytes, Eugenie Zhu

Master's Projects

The National Aeronautics and Space Administration (NASA) has performed many experiments on the International Space Station (ISS) to further understand how conditions in space can affect life on Earth. This project analyzed GLDS-258, a gene set from NASA’s GeneLab repository which examines the impact of microgravity on human induced pluripotent stem-cell-derived cardiomyocytes (hiPSC-CMs). While many datasets have been run through NASA’s RNA-Seq Consensus Pipeline (RCP) to study differential gene expression in space, a Homo sapiens dataset has yet to be analyzed using the RCP. The aim of this project was to run the first Homo sapiens dataset, GLDS-258 ...


Wildfire Risk Prediction For A Smart City, Rekha Rani 2021 San Jose State University

Wildfire Risk Prediction For A Smart City, Rekha Rani

Master's Projects

Wildfires are uncontrolled fires that may lead to the destruction of biodiversity, soil fertility, and human resources. There is a need for timely detection and prediction of wildfires to minimize their disastrous effects. In this research, we propose a wildfire prediction model that relies on multi-criteria decision making (MCDM) to explicitly evaluates multiple conflicting criteria in decision making and weave the wildfire risks into the city’s resiliency plan. We incorporate fuzzy set theory to handle imprecision and uncertainties. In the process, we create a new data set that includes California cities’ weather, vegetation, topography, and population density records. The ...


An Empirical Study Of Refactorings And Technical Debt In Machine Learning Systems, Yiming Tang, Raffi T. Khatchadourian, Mehdi Bagherzadeh, Rhia Singh, Ajani Stewart, Anita Raja 2021 CUNY Graduate Center

An Empirical Study Of Refactorings And Technical Debt In Machine Learning Systems, Yiming Tang, Raffi T. Khatchadourian, Mehdi Bagherzadeh, Rhia Singh, Ajani Stewart, Anita Raja

Publications and Research

Machine Learning (ML), including Deep Learning (DL), systems, i.e., those with ML capabilities, are pervasive in today’s data-driven society. Such systems are complex; they are comprised of ML models and many subsystems that support learning processes. As with other complex systems, ML systems are prone to classic technical debt issues, especially when such systems are long-lived, but they also exhibit debt specific to these systems. Unfortunately, there is a gap of knowledge in how ML systems actually evolve and are maintained. In this paper, we fill this gap by studying refactorings, i.e., source-to-source semantics-preserving program transformations, performed ...


Federated Learning In Gaze Recognition (Fligr), Arun Gopal Govindaswamy 2021 DePaul University

Federated Learning In Gaze Recognition (Fligr), Arun Gopal Govindaswamy

College of Computing and Digital Media Dissertations

The efficiency and generalizability of a deep learning model is based on the amount and diversity of training data. Although huge amounts of data are being collected, these data are not stored in centralized servers for further data processing. It is often infeasible to collect and share data in centralized servers due to various medical data regulations. This need for diversely distributed data and infeasible storage solutions calls for Federated Learning (FL). FL is a clever way of utilizing privately stored data in model building without the need for data sharing. The idea is to train several different models locally ...


Reporting Of Eating Disorder Deaths, Katherine Mobley, Amy Hord 2021 Kennesaw State University

Reporting Of Eating Disorder Deaths, Katherine Mobley, Amy Hord

Symposium of Student Scholars

Those affected by eating disorders experience disturbances in eating behaviors which are often related to underlying psychiatric disorders such as anxiety, depression, or obsessive-compulsive disorder (Parekh, 2017, Drieberg et al., 1998 p.53). The duplicitous nature of the disorder makes it difficult to diagnose, and the tole it takes on an individual’s physical health makes its mortality rate the second highest among psychiatric disorders (Guinhut et al., 2021 p.130). Even if the correct education and resources are accessible to certain individuals, negative stigmatization about the disorder can make sufferers unlikely to seek help (Becker et al., 2010). Findings ...


Using Machine Learning Methods To Predict The Movement Trajectories Of The Louisiana Black Bear, Daniel Clark, David Shaw, Armando Vela, Shane Weinstock, John Santerre, Joseph D. Clark 2021 Southern Methodist University

Using Machine Learning Methods To Predict The Movement Trajectories Of The Louisiana Black Bear, Daniel Clark, David Shaw, Armando Vela, Shane Weinstock, John Santerre, Joseph D. Clark

SMU Data Science Review

In 1992, the Louisiana black bear (Ursus americanus luteolus) was placed on the U.S. Endangered Species List. This was due to bear populations in Louisiana being small and isolated enough where their populations couldn’t intersect with other populations to grow. Interchange of individuals between subpopulations of bears in Louisiana is critical to maintain genetic diversity and avoid inbreeding effects. Utilizing GPS (Global Positioning System) data gathered from 31 radio-collared bears from 2010 through 2012, this research will investigate how bears traverse the landscape, which has implications for gene exchange. This paper will leverage machine learning tools to improve ...


Analyzing Empirical Quality Metrics Of Deep Learning Models For Antimicrobial Resistance, Huy H. Nguyen, Sanjay Pillay, Allison Roderick, Hao Wang, John Santerre 2021 Southern Methodist University

Analyzing Empirical Quality Metrics Of Deep Learning Models For Antimicrobial Resistance, Huy H. Nguyen, Sanjay Pillay, Allison Roderick, Hao Wang, John Santerre

SMU Data Science Review

Antimicrobial Resistance (AMR) is a growing concern in the medical field. Over-prescription of antibiotics as well as bacterial mutations have caused some once lifesaving drugs to become ineffective against bacteria. However, the problem of AMR might be addressed using Machine Learning (ML) thanks to increased availability of genomic data and large computing resources. The Pathosystems Resource Integration Center (PATRIC) has genomic data of various bacterial genera with sample isolates that are either resistant or susceptible to certain antibiotics. Past research has used this database to use ML algorithms to model AMR with successful results, including accuracies over 80%. To better ...


Introducing Reproducibility To Citation Analysis: A Case Study In The Earth Sciences, Samantha Teplitzky, Wynn Tranfield, Mea Warren, Philip White 2021 University of California, Berkeley

Introducing Reproducibility To Citation Analysis: A Case Study In The Earth Sciences, Samantha Teplitzky, Wynn Tranfield, Mea Warren, Philip White

Journal of eScience Librarianship

Objectives:

  • Replicate methods from a 2019 study of Earth Science researcher citation practices.
  • Calculate programmatically whether researchers in Earth Science rely on a smaller subset of literature than estimated by the 80/20 rule.
  • Determine whether these reproducible citation analysis methods can be used to analyze open access uptake.

Methods: Replicated methods of a prior citation study provide an updated transparent, reproducible citation analysis protocol that can be replicated with Jupyter Notebooks.

Results: This study replicated the prior citation study’s conclusions, and also adapted the author’s methods to analyze the citation practices of Earth Scientists at four institutions ...


Analysis Of Individual Player Performances And Their Effect On Winning In College Soccer, Angelo Bravo, Thomas Karba, Sean McWhirter, Billy Nayden 2021 Southern Methodist University

Analysis Of Individual Player Performances And Their Effect On Winning In College Soccer, Angelo Bravo, Thomas Karba, Sean Mcwhirter, Billy Nayden

SMU Data Science Review

This study describes the process of modernizing the approach of the Southern Methodist University (SMU) Men's Soccer coaching staff through the use of location and tracking data from their matches in the 2019 season. This study utilizes a variety of modeling and analysis techniques to explore and categorize the data and use it to evaluate the types of plays that are most often correlated with victories. This study's contribution to college soccer analytics includes the implementation of a model to determine individual players' performance, the production of team-level metrics, and visualizations to increase the efficiency of the coaching ...


Machine Learning In The Health Industry: Predicting Congestive Heart Failure And Impactors, Alexandra Norman, James Harding, Daria Zhukova 2021 Southern Methodist University

Machine Learning In The Health Industry: Predicting Congestive Heart Failure And Impactors, Alexandra Norman, James Harding, Daria Zhukova

SMU Data Science Review

Cardiovascular diseases, Congestive Heart Failure in particular, are a leading cause of deaths worldwide. Congestive Heart Failure has high mortality and morbidity rates. The key to decreasing the morbidity and mortality rates associated with Congestive Heart Failure is determining a method to detect high-risk individuals prior to the development of this often-fatal disease. Providing high-risk individuals with advanced knowledge of risk factors that could potentially lead to Congestive Heart Failure, enhances the likelihood of preventing the disease through implementation of lifestyle changes for healthy living. When dealing with healthcare and patient data, there are restrictions that led to difficulties accessing ...


Generating And Smoothing Handwriting With Long Short-Term Memory Networks, muchigi kimari, Edward Fry, Ikenna Nwaogu, YuMei Bennett, John Santerre 2021 Southern Methodist University

Generating And Smoothing Handwriting With Long Short-Term Memory Networks, Muchigi Kimari, Edward Fry, Ikenna Nwaogu, Yumei Bennett, John Santerre

SMU Data Science Review

This project explores the different neural network methods to generate synthetic handwriting text. The goal is to offer an AI tool that generates handwriting, while maintaining an individual’s style, to people suffering with Dysgraphia. As part of this project, an application development framework is setup on GitHub, in such a way that others can continue to explore and improve the AI tool.


A Machine Learning Method Of Determining Causal Inference Applied To Shifts In Voting Preferences Between 2012-2016, Jaclyn A. Coate, Reagan Meagher, Megan Riley, John Santerre 2021 Southern Methodist University

A Machine Learning Method Of Determining Causal Inference Applied To Shifts In Voting Preferences Between 2012-2016, Jaclyn A. Coate, Reagan Meagher, Megan Riley, John Santerre

SMU Data Science Review

This research investigates the application of machine learning techniques to assist in the execution of a synthetic control model. This model was performed to analyze counties within the United States that showed a voter shift from a majority of Democratic voter share to Republican between the 2012 and 2016 election cycles. The following study applies two steps of machine learning analysis. The first, which is the treatment discovery process, leverages a Random Forest to evaluate feature importance. The second step was the execution of the synthetic control model with two predictor variable lists. The first was the parametric method: a ...


Digital Commons powered by bepress