Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 20 of 20

Full-Text Articles in Physical Sciences and Mathematics

A Data Science Approach To Defining A Data Scientist, Andy Ho, An Nguyen, Jodi L. Pafford, Robert Slater Dec 2019

A Data Science Approach To Defining A Data Scientist, Andy Ho, An Nguyen, Jodi L. Pafford, Robert Slater

SMU Data Science Review

In this paper, we present a common definition and list of skills for a Data Scientist using online job postings. The overlap and ambiguity of various roles such as data scientist, data engineer, data analyst, software engineer, database administrator, and statistician motivate the problem. To arrive at a single Data Scientist definition, we collect over 8,000 job postings from Indeed.com for the six job titles. Each corpus contains text on job qualifications, skills, responsibilities, educational preferences, and requirements. Our data science methodology and analysis rendered the single definition of a data scientist: A data scientist codes, collaborates, and communicates – …


Aspect And Opinion Aware Abstractive Review Summarization With Reinforced Hard Typed Decoder, Yufei Tian, Jianfei Yu, Jing Jiang Nov 2019

Aspect And Opinion Aware Abstractive Review Summarization With Reinforced Hard Typed Decoder, Yufei Tian, Jianfei Yu, Jing Jiang

Research Collection School Of Computing and Information Systems

In this paper, we study abstractive review summarization. Observing that review summaries often consist of aspect words, opinion words and context words, we propose a two-stage reinforcement learning approach, which first predicts the output word type from the three types, and then leverages the predicted word type to generate the final word distribution. Experimental results on two Amazon product review datasets demonstrate that our method can consistently outperform several strong baseline approaches based on ROUGE scores.


Classifying Fiction And Non-Fiction Works Using Machine Learning, Rachna Gupta '21 Oct 2019

Classifying Fiction And Non-Fiction Works Using Machine Learning, Rachna Gupta '21

Student Publications & Research

The objective of this project was to create a program that can determine whether an unknown text is a work of fiction or non-fiction using machine learning. Various datasets of speeches, ebooks, poems, scientific papers, and texts from Project Gutenberg and the Wolfram Example Data were utilized to train and test a Markov Chain machine learning model. A microsite was deployed with the final product that returns a probability of fictionality based on input from the user with 95% accuracy.


Automatic Inference Of Causal Reasoning Chains From Student Essays, Simon Mark Hughes Oct 2019

Automatic Inference Of Causal Reasoning Chains From Student Essays, Simon Mark Hughes

College of Computing and Digital Media Dissertations

While there has been an increasing focus on higher-level thinking skills arising from the Common Core Standards, many high-school and middle-school students struggle to combine and integrate information from multiple sources when writing essays. Writing is an important learning skill, and there is increasing evidence that writing about a topic develops a deeper understanding in the student. However, grading essays is time consuming for teachers, resulting in an increasing focus on shallower forms of assessment that are easier to automate, such as multiple-choice tests. Existing essay grading software has attempted to ease this burden but relies on shallow lexico-syntactic features …


Knowledge Base Question Answering With Topic Units, Yunshi Lan, Shuohang Wang, Jing Jiang Aug 2019

Knowledge Base Question Answering With Topic Units, Yunshi Lan, Shuohang Wang, Jing Jiang

Research Collection School Of Computing and Information Systems

Knowledge base question answering (KBQA) is an important task in natural language processing. Existing methods for KBQA usually start with entity linking, which considers mostly named entities found in a question as the starting points in the KB to search for answers to the question. However, relying only on entity linking to look for answer candidates may not be sufficient. In this paper, we propose to perform topic unit linking where topic units cover a wider range of units of a KB. We use a generation-and-scoring approach to gradually refine the set of topic units. Furthermore, we use reinforcement learning …


Cold-Start Aware Deep Memory Networks For Multi-Entity Aspect-Based Sentiment Analysis, Kaisong Song, Wei Gao, Lujun Zhao, Changlong Sun, Xiaozhong Liu Aug 2019

Cold-Start Aware Deep Memory Networks For Multi-Entity Aspect-Based Sentiment Analysis, Kaisong Song, Wei Gao, Lujun Zhao, Changlong Sun, Xiaozhong Liu

Research Collection School Of Computing and Information Systems

Various types of target information have been considered in aspect-based sentiment analysis, such as entities and aspects. Existing research has realized the importance of targets and developed methods with the goal of precisely modeling their contexts via generating target-specific representations. However, all these methods ignore that these representations cannot be learned well due to the lack of sufficient human-annotated target-related reviews, which leads to the data sparsity challenge, a.k.a. cold-start problem here. In this paper, we focus on a more general multiple entity aspect-based sentiment analysis (ME-ABSA) task which aims at identifying the sentiment polarity of different aspects of multiple …


Adapting Bert For Target-Oriented Multimodal Sentiment Classification, Jianfei Yu, Jing Jiang Aug 2019

Adapting Bert For Target-Oriented Multimodal Sentiment Classification, Jianfei Yu, Jing Jiang

Research Collection School Of Computing and Information Systems

As an important task in Sentiment Analysis, Target-oriented Sentiment Classification (TSC) aims to identify sentiment polarities over each opinion target in a sentence. However, existing approaches to this task primarily rely on the textual content, but ignoring the other increasingly popular multimodal data sources (e.g., images), which can enhance the robustness of these text-based models. Motivated by this observation and inspired by the recently proposed BERT architecture, we study Target-oriented Multimodal Sentiment Classification (TMSC) and propose a multimodal BERT architecture. To model intra-modality dynamics, we first apply BERT to obtain target-sensitive textual representations. We then borrow the idea from self-attention …


An Intelligent Platform With Automatic Assessment And Engagement Features For Active Online Discussions, Michelle L. F. Cheong, Yun-Chen Chen, Bing Tian Dai Jul 2019

An Intelligent Platform With Automatic Assessment And Engagement Features For Active Online Discussions, Michelle L. F. Cheong, Yun-Chen Chen, Bing Tian Dai

Research Collection School Of Computing and Information Systems

In a universitycontext, discussion forums are mostly available in Learning and ManagementSystems (LMS) but are often ineffective in encouraging participation due topoorly designed user interface and the lack of motivating factors toparticipate. Our integrated platform with the Telegram mobile app and aweb-based forum, is capable of automatic thoughtfulness assessment of questionsand answers posted, using text mining and Natural Language Processing (NLP)methodologies. We trained and applied the Random Forest algorithm to provideinstant thoughtfulness score prediction for the new posts contributed by thestudents, and prompted the students to improve on their posts, thereby invokingdeeper thinking resulting in better quality contributions. In addition, …


Use Of Text Data In Identifying And Prioritizing Potential Drug Repositioning Candidates, Majid Rastegar-Mojarad May 2019

Use Of Text Data In Identifying And Prioritizing Potential Drug Repositioning Candidates, Majid Rastegar-Mojarad

Theses and Dissertations

New drug development costs between 500 million and 2 billion dollars and takes 10-15 years, with a success rate of less than 10%. Drug repurposing (defined as discovering new indications for existing drugs) could play a significant role in drug development, especially considering the declining success rates of developing novel drugs. In the period 2007-2009, drug repurposing led to the launching of 30-40% of new drugs. Typically, new indications for existing medications are identified by accident. However, new technologies and a large number of available resources enable the development of systematic approaches to identify and validate drug-repurposing candidates with significantly …


Commonsense Knowledge In Sentiment Analysis Of Ordinance Reactions For Smart Governance, Manish Puri May 2019

Commonsense Knowledge In Sentiment Analysis Of Ordinance Reactions For Smart Governance, Manish Puri

Theses, Dissertations and Culminating Projects

Smart Governance is an emerging research area which has attracted scientific as well as policy interests, and aims to improve collaboration between government and citizens, as well as other stakeholders. Our project aims to enable lawmakers to incorporate data driven decision making in enacting ordinances. Our first objective is to create a mechanism for mapping ordinances (local laws) and tweets to Smart City Characteristics (SCC). The use of SCC has allowed us to create a mapping between a huge number of ordinances and tweets, and the use of Commonsense Knowledge (CSK) has allowed us to utilize human judgment in mapping. …


An Instruction Embedding Model For Binary Code Analysis, Kimberly Michelle Redmond Apr 2019

An Instruction Embedding Model For Binary Code Analysis, Kimberly Michelle Redmond

Theses and Dissertations

Binary code analysis is important for understanding programs without access to the original source code, which is common with proprietary software. Analyzing binaries can be challenging given their high variability: due to growth in tech manufactur- ers, source code is now frequently compiled for multiple instruction set architectures (ISAs); however, there is no formal dictionary that translates between their assem- bly languages. The difficulty of analysis is further compounded by different compiler optimizations and obfuscated malware signatures. Such minutiae means that some vulnerabilities may only be detectable on a fine-grained level. Recent strides in ma- chine learning—particularly in Natural Language …


Cs04all: Natural Language Processing Project, Hunter R. Johnson Feb 2019

Cs04all: Natural Language Processing Project, Hunter R. Johnson

Open Educational Resources

In this archive there are two activities/assignments suitable for use in a CS0 or Intro course which uses Python.

In the first activity, students are asked to "fill in the code" in a series of short programs that compute a similarity metric (cosine similarity) for text documents. This involves string tokenization, and frequency counting using Python string methods and datatypes.

https://cocalc.com/share/bde99afd-76c8-493d-9608-db9019bcd346/171/Proj1?viewer=share/

In the second activity (taken directly from Think Python 2e) students use a pronunciation dictionary to solve a riddle involving homophones.

https://cocalc.com/share/bde99afd-76c8-493d-9608-db9019bcd346/171/Dicts2?viewer=share/

This OER material was produced as a result of the CS04ALL CUNY OER project


Culture Clubs: Processing Speech By Deriving And Exploiting Linguistic Subcultures, David Guy Brizan Feb 2019

Culture Clubs: Processing Speech By Deriving And Exploiting Linguistic Subcultures, David Guy Brizan

Dissertations, Theses, and Capstone Projects

Spoken language understanding systems are error-prone for several reasons, including individual speech variability. This is manifested in many ways, among which are differences in pronunciation, lexical inventory, grammar and disfluencies. There is, however, a lot of evidence pointing to stable language usage within subgroups of a language population. We call these subgroups linguistic subcultures.

The two broad problems are defined and a survey of the work in this space is performed. The two broad problems are: linguistic subculture detection, commonly performed via Language Identification, Accent Identification or Dialect Identification approaches; and speech and language processing tasks taken which may see …


Opioid Misuse Detection In Hospitalized Patients Using Convolutional Neural Networks, Brihat Sharma Jan 2019

Opioid Misuse Detection In Hospitalized Patients Using Convolutional Neural Networks, Brihat Sharma

Master's Theses

Opioid misuse is a major public health problem in the world. In 2016, 11.3 million people were reported to misuse opioids in the US only. Opioid-related inpatient and emergency department visits have increased by 64 percent and the rate of opioid-related visits has nearly doubled between 2009 and 2014. It is thus critical for healthcare systems to detect opioid misuse cases. Patients hospitalized for consequences of their opioid misuse present an opportunity for intervention but better screening and surveillance methods are needed to guide providers. The current screening methods with self-report questionnaire data are time-consuming and difficult to perform in …


Assessing The Quality Of Software Development Tutorials Available On The Web, Manziba A. Nishi Jan 2019

Assessing The Quality Of Software Development Tutorials Available On The Web, Manziba A. Nishi

Theses and Dissertations

Both expert and novice software developers frequently access software development resources available on the Web in order to lookup or learn new APIs, tools and techniques. Software quality is affected negatively when developers fail to find high-quality information relevant to their problem. While there is a substantial amount of freely available resources that can be accessed online, some of the available resources contain information that suffers from error proneness, copyright infringement, security concerns, and incompatible versions. Use of such toxic information can have a strong negative effect on developer’s efficacy. This dissertation focuses specifically on software tutorials, aiming to automatically …


A Tree-Based Approach For English-To-Turkish Translation, Özge Bakay, Begüm Avar, Olcay Taner Yildiz Jan 2019

A Tree-Based Approach For English-To-Turkish Translation, Özge Bakay, Begüm Avar, Olcay Taner Yildiz

Turkish Journal of Electrical Engineering and Computer Sciences

In this paper, we present our English-to-Turkish translation methodology, which adopts a tree-based approach. Our approach relies on tree analysis and the application of structural modification rules to get the target side (Turkish) trees from source side (English) ones. We also use morphological analysis to get candidate root words and apply tree-based rules to obtain the agglutinated target words. Compared to earlier work on English-to-Turkish translation using phrase-based models, we have been able to obtain higher BLEU scores in our current study. Our syntactic subtree permutation strategy, combined with a word replacement algorithm, provides a 67 % relative improvement from …


Building An Automated Q-A System Using Online Forums As Knowledge Bases, Kyle Moore Jan 2019

Building An Automated Q-A System Using Online Forums As Knowledge Bases, Kyle Moore

Electronic Theses and Dissertations

Question-Answer systems traditionally use expensive and difficult to produce structured knowledge bases. Recent systems have used unstructured natural language sources as their datasets, but most of those sources have been overly broad or difficult to extend. Online forums are a largely untapped source of information that can provide both depth and breadth when limited to a specific domain, as well as being adaptive to the introduction of new information. In this paper, I conjecture that online forums can be similarly and effectively used as an unstructured knowledge base for Question-Answer systems. I use a relatively simple summarization-based approach to analyze …


Curtus: An Nlp Tool To Map Job Skills To Academic Courses, Daniel Rockwell Jan 2019

Curtus: An Nlp Tool To Map Job Skills To Academic Courses, Daniel Rockwell

Theses and Dissertations

Many businesses are burdened with the need to train students for the job instead of finding them prepared for it. Few business leaders feel that colleges prepare students for future jobs from day one. It can be a challenge for colleges to determine if their curricula meet the industry needs. Mapping industry needs to academic courses can be advantageous to both parties as it will allow colleges to be aligned with the industry needs and accordingly satisfy those needs and will allow the industry to hire better prepared graduates. In an attempt to address this, a system prototype that uses …


Indirect Relatedness, Evaluation, And Visualization For Literature Based Discovery, Sam Henry Jan 2019

Indirect Relatedness, Evaluation, And Visualization For Literature Based Discovery, Sam Henry

Theses and Dissertations

The exponential growth of scientific literature is creating an increased need for systems to process and assimilate knowledge contained within text. Literature Based Discovery (LBD) is a well established field that seeks to synthesize new knowledge from existing literature, but it has remained primarily in the theoretical realm rather than in real-world application. This lack of real-world adoption is due in part to the difficulty of LBD, but also due to several solvable problems present in LBD today. Of these problems, the ones in most critical need of improvement are: (1) the over-generation of knowledge by LBD systems, (2) a …


An Evaluation Of Learning Employing Natural Language Processing And Cognitive Load Assessment, Mrunal Tipari Jan 2019

An Evaluation Of Learning Employing Natural Language Processing And Cognitive Load Assessment, Mrunal Tipari

Dissertations

One of the key goals of Pedagogy is to assess learning. Various paradigms exist and one of this is Cognitivism. It essentially sees a human learner as an information processor and the mind as a black box with limited capacity that should be understood and studied. With respect to this, an approach is to employ the construct of cognitive load to assess a learner's experience and in turn design instructions better aligned to the human mind. However, cognitive load assessment is not an easy activity, especially in a traditional classroom setting. This research proposes a novel method for evaluating learning …