Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

233 Full-Text Articles 347 Authors 192,439 Downloads 63 Institutions

All Articles in Computational Linguistics

Faceted Search

233 full-text articles. Page 5 of 11.

Inferring Research Fields In Administrative Records Using Text Data, Ekaterina Levitskaya 2020 The Graduate Center, City University of New York

Inferring Research Fields In Administrative Records Using Text Data, Ekaterina Levitskaya

Dissertations, Theses, and Capstone Projects

The UMETRICS database (Universities: Measuring the Effects of Research on Innovation, Competitiveness, and Science) contains rich information on grants from sponsored federal and non-federal research for 32 universities over a 15-year period. It is hosted at IRIS (Institute for Research on Innovation and Science, University of Michigan) and serves as a rich source of university administrative data; however, it does not contain information on research fields. Categorizing grants data by research field can help to measure results of investment in research and science and provide evidence for the data-driven policy-making; yet administrative data often lacks this type of categorization. In …


Genderlects In Social Media, Alina Korovatskaya 2020 The Graduate Center, City University of New York

Genderlects In Social Media, Alina Korovatskaya

Dissertations, Theses, and Capstone Projects

Many studies have found significant differences in ways men and women use language; some argue that these differences occur as a result of culture differences, and others suggest that they are influenced by differences in social status and power between the genders. However, some of the major studies were concluded decades ago and do not reflect changes in gender relations in recent years. In this study, we analyze modern conversations using two social media platforms, Twitter and Reddit, to determine whether substantial differences between men and women’s use of language were preserved between the genders.


English Wordnet Taxonomic Random Walk Pseudo-Corpora, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher 2020 Technological University Dublin

English Wordnet Taxonomic Random Walk Pseudo-Corpora, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher

Conference papers

This is a resource description paper that describes the creation and properties of a set of pseudo-corpora generated artificially from a random walk over the English WordNet taxonomy. Our WordNet taxonomic random walk implementation allows the exploration of different random walk hyperparameters and the generation of a variety of different pseudo-corpora. We find that different combinations of the walk’s hyperparameters result in varying statistical properties of the generated pseudo-corpora. We have published a total of 81 pseudo-corpora that we have used in our previous research, but have not exhausted all possible combinations of hyperparameters, which is why we have also …


Investigation Of The Consonant Endings Of The Chaoshan Dialect: A Result Of Language Contact And Horizontal Transmission, Jin Chen 2020 University of Massachusetts Amherst

Investigation Of The Consonant Endings Of The Chaoshan Dialect: A Result Of Language Contact And Horizontal Transmission, Jin Chen

Masters Theses

This thesis studies the inter-group variation of the consonant endings among five principal subgroups of the Chaoshan dialect, a branch of the South Min dialect in Eastern Guangdong Province, from the perspective of language contact and horizontal transmission. I conduct a quantitative study to present the synchronic variance of the consonant endings among five Chaoshan subgroups and the diachronic variance from Middle Chinese to modern Chaoshan dialect on a numerical scale.

The current literature tends to take the change of the consonant endings as a process of weakening governed by regular rules. My research findings challenge this conventional view. First, …


Chaprates, Brinly Xavier, Micole Amanda Marietta, Nidhi Vedantam 2020 Chapman University

Chaprates, Brinly Xavier, Micole Amanda Marietta, Nidhi Vedantam

Student Scholar Symposium Abstracts and Posters

On the Chapman campus, through taking and choosing various classes, there is a significant need for communication and feedback between students and peers, professors, tutors, and study groups. With this, we wanted to create an application that enables users from various majors to not only easily and effectively communicate with various people in their field, but one that also enables them to give and receive feedback on various classes through a rating system. We believe that the application will aid students in a myriad of specific ways, including being involved in study groups and getting tutoring help, determining which classes …


Ghost Peppers: Using Ensemble Models To Detect Professor Attractiveness Commentary On Ratemyprofessors.Com, Angie Waller 2020 The Graduate Center, City University of New York

Ghost Peppers: Using Ensemble Models To Detect Professor Attractiveness Commentary On Ratemyprofessors.Com, Angie Waller

Dissertations, Theses, and Capstone Projects

In June 2018, RateMyProfessors.com (RMP), a popular website for students to leave professor reviews, removed a controversial feature known as the “chili pepper” which allowed students to rate their professors as “hot” or “not hot.” Though past research has rigorously analyzed the correlation of the chili pepper with higher ratings in other categories (Felton, Mitchell, and Stinson, 2004; Felton et al., 2008), none has measured the effect of the removal of the chili pepper on the text content submitted by students. While it is a positive step that the chili pepper has been removed, text commentary on teacher attractiveness persists …


Phonologically-Informed Speech Coding For Automatic Speech Recognition-Based Foreign Language Pronunciation Training, Anthony J. Vicario 2020 The Graduate Center, City University of New York

Phonologically-Informed Speech Coding For Automatic Speech Recognition-Based Foreign Language Pronunciation Training, Anthony J. Vicario

Dissertations, Theses, and Capstone Projects

Automatic speech recognition (ASR) and computer-assisted pronunciation training (CAPT) systems used in foreign-language educational contexts are often not developed with the specific task of second-language acquisition in mind. Systems that are built for this task are often excessively targeted to one native language (L1) or a single phonemic contrast and are therefore burdensome to train. Current algorithms have been shown to provide erroneous feedback to learners and show inconsistencies between human and computer perception. These discrepancies have thus far hindered more extensive application of ASR in educational systems.

This thesis reviews the computational models of the human perception of American …


Computational Approaches To The Syntax–Prosody Interface: Using Prosody To Improve Parsing, Hussein M. Ghaly 2020 The Graduate Center, City University of New York

Computational Approaches To The Syntax–Prosody Interface: Using Prosody To Improve Parsing, Hussein M. Ghaly

Dissertations, Theses, and Capstone Projects

Prosody has strong ties with syntax, since prosody can be used to resolve some syntactic ambiguities. Syntactic ambiguities have been shown to negatively impact automatic syntactic parsing, hence there is reason to believe that prosodic information can help improve parsing. This dissertation considers a number of approaches that aim to computationally examine the relationship between prosody and syntax of natural languages, while also addressing the role of syntactic phrase length, with the ultimate goal of using prosody to improve parsing.

Chapter 2 examines the effect of syntactic phrase length on prosody in double center embedded sentences in French. Data collected …


Determining Tone Of A Body Of Text, Cole G. Hollant 2020 Bard College

Determining Tone Of A Body Of Text, Cole G. Hollant

Senior Projects Spring 2020

We will be looking into emotion detection and manipulation within a body of text based off of Robert Plutchik’s basic emotions. This project encompasses building probabilistic and lexical models, full-stack web development, and dataset creation and application. We will build our models off of Latent Dirichlet Allocation—a grouping model common in natural language processing (nlp) and lexicons compiled through crowdsourcing. User testing is undergone as a means of measuring the effectiveness of our models. We discuss the application of concepts and technologies including MongoDB, REST APIs, containerization, IaaS, and web frontends.


The Stained Glass Of Knowledge: On Understanding Novice Mental Models Of Computing, Briana Christina Bettin 2020 Michigan Technological University

The Stained Glass Of Knowledge: On Understanding Novice Mental Models Of Computing, Briana Christina Bettin

Dissertations, Master's Theses and Master's Reports

Learning to program can be a novel experience. The rigidity of programming can be at odds with beginning programmer's existing perceptions, and the concepts can feel entirely unfamiliar. These observations motivated this research, which explores two major questions: What factors influence how novices learn programming? and How can analogy by more appropriately leveraged in programming education?

This dissertation investigates the factors influencing novice programming through multiple methods. The CS1 classroom is observed as a "whole system", with consideration to the factors present in it that can influence the learning process. Learning's cognitive processes are elaborated to ground exploration into specifically …


Pmkns For Pie: Parsed Morphological Katr Networks Of Sanskrit For Proto-Indo-European, Ryan Mark McDonald 2020 University of Kentucky

Pmkns For Pie: Parsed Morphological Katr Networks Of Sanskrit For Proto-Indo-European, Ryan Mark Mcdonald

Theses and Dissertations--Linguistics

In this thesis, I construct two computational networks for Sanskrit to test theories of nominal accentuation as a way of examining the simplicity of each theory. I will be examining the Paradigmatic Approach and the Compositional Approach to nominal accentuation. For the Paradigmatic Approach, nominals are categorized into mobile and static categories based on how the accent appears in the paradigm (Fortson 2010). For the Compositional Approach, accent mobility is a result of the combination of morphemes and their inherent accent states (Kirparsky 2010). To construct these networks, I use the KATR extension to the DATR language for lexical knowledge …


Scholarly Communication And Documentary Fragmentations In The Public Space: A Functional Citation Study, Fidelia Ibekwe, Lucie Loubère 2019 Aix Marseille Univ, Université de Toulon, IMSIC, Marseille, France

Scholarly Communication And Documentary Fragmentations In The Public Space: A Functional Citation Study, Fidelia Ibekwe, Lucie Loubère

Proceedings from the Document Academy

This paper studies how academic content published in Open Edition.org, an online publication platform in the Social Sciences and Humanities is re-appropriated by members of the public. Our research is therefore concerned with the public appropriation of science and Open science. After extracting the contexts of citation of these content and mapping them, we propose a typology of citation functions as well as of citers (their origins and types). Our preliminary results indicated that academic literature is repurposed and cited by members of the public mainly as scientific warrant (support for their argumentation). We also found that academic content is …


Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. McCoy, Robert Frank 2019 Johns Hopkins University

Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. Mccoy, Robert Frank

Robert Frank

We present three methods for weighting edit distance algorithms based on linguistic information. These methods base their penalties on (i) phonological features, (ii) distributional character embeddings, or (iii) differences between cognate words. We also introduce a novel method for evaluating edit distance through the task of low-resource word alignment by using edit-distance neighbors in a high-resource pivot language to inform alignments from the low-resource language. At this task, the cognate-based scheme outperforms our other methods and the Levenshtein edit distance baseline, showing that NLP applications can benefit from information about cross-linguistic phonological patterns.


Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank 2019 University of Washington

Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank

Robert Frank

Parsing models have long benefited from the use of lexical information, and indeed current state-of-the art neural network models for dependency parsing achieve substantial improvements by benefiting from distributed representations of lexical information. At the same time, humans can easily parse sentences with unknown or even novel words, as in Lewis Carroll’s poem Jabberwocky. In this paper, we carry out jabberwocky parsing experiments, exploring how robust a state-of-the-art neural network parser is to the absence of lexical information. We find that current parsing models, at least under usual training regimens, are in fact overly dependent on lexical information, and perform …


Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher 2019 Trinity College Dublin, Ireland

Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher

Articles

Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with …


Do It Like A Syntactician: Using Binary Gramaticality Judgements To Train Sentence Encoders And Assess Their Sensitivity To Syntactic Structure, Pablo Gonzalez Martinez 2019 The Graduate Center, City University of New York

Do It Like A Syntactician: Using Binary Gramaticality Judgements To Train Sentence Encoders And Assess Their Sensitivity To Syntactic Structure, Pablo Gonzalez Martinez

Dissertations, Theses, and Capstone Projects

The binary nature of grammaticality judgments and their use to access the structure of syntax are a staple of modern linguistics. However, computational models of natural language rarely make use of grammaticality in their training or application. Furthermore, developments in modern neural NLP have produced a myriad of methods that push the baselines in many complex tasks, but those methods are typically not evaluated from a linguistic perspective. In this dissertation I use grammaticality judgements with artificially generated ungrammatical sentences to assess the performance of several neural encoders and propose them as a suitable training target to make models learn …


Demographic Factors As Domains For Adaptation In Linguistic Preprocessing, Sara Morini 2019 The Graduate Center, City University of New York

Demographic Factors As Domains For Adaptation In Linguistic Preprocessing, Sara Morini

Dissertations, Theses, and Capstone Projects

Classic natural language processing resources such as the Penn Treebank (Marcus et al. 1993) have long been used both as evaluation data for many linguistic tasks and as training data for a variety of off-the-shelf language processing tools. Recent work has highlighted a gender imbalance in the authors of this text data (Garimella et al. 2019) and hypothesized that tools created with such resources will privilege users from particular demographic groups (Hovy and Søgaard 2015). Domain adaptation is typically employed as a strategy in machine learning to adjust models trained and evaluated with data from different genres. However, the present …


Synthetic, Yet Natural: Properties Of Wordnet Random Walk Corpora And The Impact Of Rare Words On Embedding Performance, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher 2019 Technological University Dublin

Synthetic, Yet Natural: Properties Of Wordnet Random Walk Corpora And The Impact Of Rare Words On Embedding Performance, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher

Conference papers

Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find …


Beneath The Surface Of Talking About Physicians: A Statistical Model Of Language For Patient Experience Comments, Taylor Turpen, Lea Matthews MD, Senem Guney PhD, CPXP 2019 NarrativeDx

Beneath The Surface Of Talking About Physicians: A Statistical Model Of Language For Patient Experience Comments, Taylor Turpen, Lea Matthews Md, Senem Guney Phd, Cpxp

Patient Experience Journal

This study applies natural language processing (NLP) techniques to patient experience comments. Our goal was to examine the language describing care experiences with two groups of physicians: those with scores in the top 100 and those with scores in the bottom 100 among all physicians (n=498) who received scores from patient satisfaction surveys. Our analysis showed a statistically significant difference in the language used to describe care experiences with these two distinct groups of physicians. This analysis illustrates how to apply NLP techniques in categorizing and building a statistical model for language use in order to identify meaningful language and …


The Design And Implementation Of Aida: Ancient Inscription Database And Analytics System, M Parvez Rashid 2019 University of Nebraska - Lincoln

The Design And Implementation Of Aida: Ancient Inscription Database And Analytics System, M Parvez Rashid

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

AIDA, the Ancient Inscription Database and Analytic system can be used to translate and analyze ancient Minoan language. The AIDA system currently stores three types of ancient Minoan inscriptions: Linear A, Cretan Hieroglyph and Phaistos Disk inscriptions. In addition, AIDA provides candidate syllabic values and translations of Minoan words and inscriptions into English. The AIDA system allows the users to change these candidate phonetic assignments to the Linear A, Cretan Hieroglyph and Phaistos symbols. Hence the AIDA system provides for various scholars not only a convenient online resource to browse Minoan inscriptions but also provides an analysis tool to explore …


Digital Commons powered by bepress