Open Access. Powered by Scholars. Published by Universities.®

Data Science Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 6 of 6

Full-Text Articles in Data Science

Fine-Grained Detection Of Hate Speech Using Bertoxic, Yakoob Khan Jun 2021

Fine-Grained Detection Of Hate Speech Using Bertoxic, Yakoob Khan

Dartmouth College Undergraduate Theses

This thesis describes our approach towards the fine-grained detection of hate speech using deep learning. We leverage the transformer encoder architecture to propose BERToxic, a system that fine-tunes a pre-trained BERT model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the prediction boundaries. The post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assigning a toxic label to words that have at least one token labeled as toxic. Through experiments, we show that these two post-processing steps improve the performance of our model by 4.16% on …


Lexical Complexity Prediction With Assembly Models, Aadil Islam Jun 2021

Lexical Complexity Prediction With Assembly Models, Aadil Islam

Dartmouth College Undergraduate Theses

Tuning the complexity of one's writing is essential to presenting ideas in a logical, intuitive manner to audiences. This paper describes a system submitted by team BigGreen to LCP 2021 for predicting the lexical complexity of English words in a given context. We assemble a feature engineering-based model and a deep neural network model with an underlying Transformer architecture based on BERT. While BERT itself performs competitively, our feature engineering-based model helps in extreme cases, eg. separating instances of easy and neutral difficulty. Our handcrafted features comprise a breadth of lexical, semantic, syntactic, and novel phonetic measures. Visualizations of BERT …


Automated Analysis Of Rfps Using Natural Language Processing (Nlp) For The Technology Domain, Sterling Beason, William Hinton, Yousri A. Salamah, Jordan Salsman May 2021

Automated Analysis Of Rfps Using Natural Language Processing (Nlp) For The Technology Domain, Sterling Beason, William Hinton, Yousri A. Salamah, Jordan Salsman

SMU Data Science Review

Much progress has been made in text analysis, specifically within the statistical domain of Term Frequency (TF) and Inverse Document Frequency (IDF). However, there is much room for improvement especially within the area of discovering Emerging Trends. Emerging Trend Detection Systems (ETDS) depend on ingesting a collection of textual data and TF/IDF to identify new or up-trending topics within the Corpus. However, the tremendous rate of change and the amount of digital information presents a challenge that makes it almost impossible for a human expert to spot emerging trends without relying on an automated ETD system. Since the U.S. Government …


Optimal Analytical Methods For High Accuracy Cardiac Disease Classification And Treatment Based On Ecg Data, Jianwei Zheng May 2021

Optimal Analytical Methods For High Accuracy Cardiac Disease Classification And Treatment Based On Ecg Data, Jianwei Zheng

Computational and Data Sciences (PhD) Dissertations

This work constitutes six projects. In the first project, a newly inaugurated research database for 12-lead electrocardiogram signals was created under the auspices of Chapman University and Shaoxing People's Hospital (Shaoxing Hospital Zhejiang University School of Medicine). This database aims to enable the scientific community in conducting new studies on arrhythmia and other cardiovascular conditions. In the second project, we created a new 12-lead ECG database under the auspices of Chapman University and Ningbo First Hospital of Zhejiang University that aims to provide high quality data enabling detection of the distinctions between idiopathic ventricular arrhythmia from right ventricular outflow tract …


Goes-R Supervised Machine Learning, Ronald Adomako Jan 2021

Goes-R Supervised Machine Learning, Ronald Adomako

Dissertations and Theses

The GOES-R series is a product line of four satellite, with two currently on-orbit (GOES-16 “East” and GOES-17 “West”). GOES-17 is susceptible to a Loop-Heat-Pipe (LHP) phenomenon where during Fall and Spring seasons, there are times of day where some of the infrared bands records inaccurate readings from the Advanced Baseline Imager (ABI). This occurs from joint astronomical behavior and position of the GOES-17. This calibration issue occurs when the LHP instrument fails to radiate the heat of the sun out of ABI. Predictive Calibration (pCal) is an algorithm developed by instrument vendors for the National Oceanic Atmospheric Agency (NOAA) …


Improving Space Efficiency Of Deep Neural Networks, Aliakbar Panahi Jan 2021

Improving Space Efficiency Of Deep Neural Networks, Aliakbar Panahi

Theses and Dissertations

Language models employ a very large number of trainable parameters. Despite being highly overparameterized, these networks often achieve good out-of-sample test performance on the original task and easily fine-tune to related tasks. Recent observations involving, for example, intrinsic dimension of the objective landscape and the lottery ticket hypothesis, indicate that often training actively involves only a small fraction of the parameter space. Thus, a question remains how large a parameter space needs to be in the first place — the evidence from recent work on model compression, parameter sharing, factorized representations, and knowledge distillation increasingly shows that models can be …