Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

2019

Discipline
Institution
Keyword
Publication
Publication Type

Articles 1 - 15 of 15

Full-Text Articles in Computational Linguistics

Scholarly Communication And Documentary Fragmentations In The Public Space: A Functional Citation Study, Fidelia Ibekwe, Lucie Loubère Dec 2019

Scholarly Communication And Documentary Fragmentations In The Public Space: A Functional Citation Study, Fidelia Ibekwe, Lucie Loubère

Proceedings from the Document Academy

This paper studies how academic content published in Open Edition.org, an online publication platform in the Social Sciences and Humanities is re-appropriated by members of the public. Our research is therefore concerned with the public appropriation of science and Open science. After extracting the contexts of citation of these content and mapping them, we propose a typology of citation functions as well as of citers (their origins and types). Our preliminary results indicated that academic literature is repurposed and cited by members of the public mainly as scientific warrant (support for their argumentation). We also found that academic content is …


Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. Mccoy, Robert Frank Oct 2019

Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. Mccoy, Robert Frank

Robert Frank

We present three methods for weighting edit distance algorithms based on linguistic information. These methods base their penalties on (i) phonological features, (ii) distributional character embeddings, or (iii) differences between cognate words. We also introduce a novel method for evaluating edit distance through the task of low-resource word alignment by using edit-distance neighbors in a high-resource pivot language to inform alignments from the low-resource language. At this task, the cognate-based scheme outperforms our other methods and the Levenshtein edit distance baseline, showing that NLP applications can benefit from information about cross-linguistic phonological patterns.


Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank Oct 2019

Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank

Robert Frank

Parsing models have long benefited from the use of lexical information, and indeed current state-of-the art neural network models for dependency parsing achieve substantial improvements by benefiting from distributed representations of lexical information. At the same time, humans can easily parse sentences with unknown or even novel words, as in Lewis Carroll’s poem Jabberwocky. In this paper, we carry out jabberwocky parsing experiments, exploring how robust a state-of-the-art neural network parser is to the absence of lexical information. We find that current parsing models, at least under usual training regimens, are in fact overly dependent on lexical information, and perform …


Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher Oct 2019

Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher

Articles

Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with …


Do It Like A Syntactician: Using Binary Gramaticality Judgements To Train Sentence Encoders And Assess Their Sensitivity To Syntactic Structure, Pablo Gonzalez Martinez Sep 2019

Do It Like A Syntactician: Using Binary Gramaticality Judgements To Train Sentence Encoders And Assess Their Sensitivity To Syntactic Structure, Pablo Gonzalez Martinez

Dissertations, Theses, and Capstone Projects

The binary nature of grammaticality judgments and their use to access the structure of syntax are a staple of modern linguistics. However, computational models of natural language rarely make use of grammaticality in their training or application. Furthermore, developments in modern neural NLP have produced a myriad of methods that push the baselines in many complex tasks, but those methods are typically not evaluated from a linguistic perspective. In this dissertation I use grammaticality judgements with artificially generated ungrammatical sentences to assess the performance of several neural encoders and propose them as a suitable training target to make models learn …


Demographic Factors As Domains For Adaptation In Linguistic Preprocessing, Sara Morini Sep 2019

Demographic Factors As Domains For Adaptation In Linguistic Preprocessing, Sara Morini

Dissertations, Theses, and Capstone Projects

Classic natural language processing resources such as the Penn Treebank (Marcus et al. 1993) have long been used both as evaluation data for many linguistic tasks and as training data for a variety of off-the-shelf language processing tools. Recent work has highlighted a gender imbalance in the authors of this text data (Garimella et al. 2019) and hypothesized that tools created with such resources will privilege users from particular demographic groups (Hovy and Søgaard 2015). Domain adaptation is typically employed as a strategy in machine learning to adjust models trained and evaluated with data from different genres. However, the present …


Synthetic, Yet Natural: Properties Of Wordnet Random Walk Corpora And The Impact Of Rare Words On Embedding Performance, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher Jul 2019

Synthetic, Yet Natural: Properties Of Wordnet Random Walk Corpora And The Impact Of Rare Words On Embedding Performance, Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John D. Kelleher

Conference papers

Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find …


Beneath The Surface Of Talking About Physicians: A Statistical Model Of Language For Patient Experience Comments, Taylor Turpen, Lea Matthews Md, Senem Guney Phd, Cpxp Jul 2019

Beneath The Surface Of Talking About Physicians: A Statistical Model Of Language For Patient Experience Comments, Taylor Turpen, Lea Matthews Md, Senem Guney Phd, Cpxp

Patient Experience Journal

This study applies natural language processing (NLP) techniques to patient experience comments. Our goal was to examine the language describing care experiences with two groups of physicians: those with scores in the top 100 and those with scores in the bottom 100 among all physicians (n=498) who received scores from patient satisfaction surveys. Our analysis showed a statistically significant difference in the language used to describe care experiences with these two distinct groups of physicians. This analysis illustrates how to apply NLP techniques in categorizing and building a statistical model for language use in order to identify meaningful language and …


The Design And Implementation Of Aida: Ancient Inscription Database And Analytics System, M Parvez Rashid Jul 2019

The Design And Implementation Of Aida: Ancient Inscription Database And Analytics System, M Parvez Rashid

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

AIDA, the Ancient Inscription Database and Analytic system can be used to translate and analyze ancient Minoan language. The AIDA system currently stores three types of ancient Minoan inscriptions: Linear A, Cretan Hieroglyph and Phaistos Disk inscriptions. In addition, AIDA provides candidate syllabic values and translations of Minoan words and inscriptions into English. The AIDA system allows the users to change these candidate phonetic assignments to the Linear A, Cretan Hieroglyph and Phaistos symbols. Hence the AIDA system provides for various scholars not only a convenient online resource to browse Minoan inscriptions but also provides an analysis tool to explore …


Analyzing Prosody With Legendre Polynomial Coefficients, Rachel Rakov May 2019

Analyzing Prosody With Legendre Polynomial Coefficients, Rachel Rakov

Dissertations, Theses, and Capstone Projects

This investigation demonstrates the effectiveness of Legendre polynomial coefficients representing prosodic contours within the context of two different tasks: nativeness classification and sarcasm detection. By making use of accurate representations of prosodic contours to answer fundamental linguistic questions, we contribute significantly to the body of research focused on analyzing prosody in linguistics as well as modeling prosody for machine learning tasks. Using Legendre polynomial coefficient representations of prosodic contours, we answer prosodic questions about differences in prosody between native English speakers and non-native English speakers whose first language is Mandarin. We also learn more about prosodic qualities of sarcastic speech. …


The Perception Of Mandarin Tones In "Bubble" Noise By Native And L2 Listeners, Mengxuan Zhao May 2019

The Perception Of Mandarin Tones In "Bubble" Noise By Native And L2 Listeners, Mengxuan Zhao

Dissertations, Theses, and Capstone Projects

Previous studies have revealed the complexity of Mandarin Tones. For example, similarities in the pitch contours of tones 2 and 3 and tones 3 and 4 cause confusion for listeners. The realization of a tone's contour is highly dependent on its context, especially the preceding pitch. This is known as the coarticulation effect. Researchers have demonstrated the robustness of tone perception by both native and non-native listeners, even with incomplete acoustic information or in noisy environment. However, non-native listeners were observed to behave differently from native listeners in their use of contextual information. For example, the disagreement between the end …


Quantifying Coherence In A Transdiagnostic Sample: A Methodological Investigation Of Computationally-Derived Coherence Using Ambulatory Assessment, Taylor L. Fedechko Mar 2019

Quantifying Coherence In A Transdiagnostic Sample: A Methodological Investigation Of Computationally-Derived Coherence Using Ambulatory Assessment, Taylor L. Fedechko

LSU Master's Theses

Schizophrenia is a clinical diagnosis assigned to individuals that experience positive (e.g., hallucinations and delusions), negative (e.g., blunted affect), and disorganized (e.g., incoherent speech) symptoms. One particularly disabling symptom is incoherence, which is defined as the meaning-based relationship between ideas. This symptom can drastically affect an individual’s quality of life by affecting areas such as social and occupational functioning. Currently, the mechanism behind this symptom is unknown and requires further study. One way to examine incoherence is to understand its level of expression in other clinical populations. With the advent of computationally-derived natural language processing (NLP), coherence can be quantified …


Obfuscating Authorship: Results Of A User Study On Nondescript, A Digital Privacy Tool, Robin Camille Davis Feb 2019

Obfuscating Authorship: Results Of A User Study On Nondescript, A Digital Privacy Tool, Robin Camille Davis

Publications and Research

For those who write anonymously, particularly for safety reasons, authorship attribution poses a threat. Nondescript, my web app, guides writers in achieving stylometric obfuscation in order to preserve anonymity. The app runs simulations of authorship attribution scenarios by analyzing the user’s linguistic features. In this paper, I will describe the conception of the Nondescript app; discuss related work; and present the results of a user study. Most users in the study were able to anonymize their writing in at least 5 out of 10 authorship attribution scenarios. Users rated the anonymization process an average of 3.6 out of 5 in …


Generative Adversarial Networks And Word Embeddings For Natural Language Generation, Robert D. Schultz Jr Feb 2019

Generative Adversarial Networks And Word Embeddings For Natural Language Generation, Robert D. Schultz Jr

Dissertations, Theses, and Capstone Projects

We explore using image generation techniques to generate natural language. Generative Adversarial Networks (GANs), normally used for image generation, were used for this task. To avoid using discrete data such as one-hot encoded vectors, with dimensions corresponding to vocabulary size, we instead use word embeddings as training data. The main motivation for this is the fact that a sentence translated into a sequence of word embeddings (a “word matrix”) is an analogue to a matrix of pixel values in an image. These word matrices can then be used to train a generative adversarial model. The output of the model’s generator …


Application Of Boolean Logic To Natural Language Complexity In Political Discourse, Austin Taing Jan 2019

Application Of Boolean Logic To Natural Language Complexity In Political Discourse, Austin Taing

Theses and Dissertations--Computer Science

Press releases serve as a major influence on public opinion of a politician, since they are a primary means of communicating with the public and directing discussion. Thus, the public’s ability to digest them is an important factor for politicians to consider. This study employs several well-studied measures of linguistic complexity and proposes a new one to examine whether politicians change their language to become more or less difficult to parse in different situations. This study uses 27,500 press releases from the US Senate between 2004–2008 and examines election cycles and natural disasters, namely hurricanes, as situations where politicians’ language …