Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

Selected Works

Discipline
Institution
Keyword
Publication Year
Publication
File Type

Articles 1 - 22 of 22

Full-Text Articles in Computational Linguistics

Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. Mccoy, Robert Frank Oct 2019

Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. Mccoy, Robert Frank

Robert Frank

We present three methods for weighting edit distance algorithms based on linguistic information. These methods base their penalties on (i) phonological features, (ii) distributional character embeddings, or (iii) differences between cognate words. We also introduce a novel method for evaluating edit distance through the task of low-resource word alignment by using edit-distance neighbors in a high-resource pivot language to inform alignments from the low-resource language. At this task, the cognate-based scheme outperforms our other methods and the Levenshtein edit distance baseline, showing that NLP applications can benefit from information about cross-linguistic phonological patterns.


Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank Oct 2019

Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank

Robert Frank

Parsing models have long benefited from the use of lexical information, and indeed current state-of-the art neural network models for dependency parsing achieve substantial improvements by benefiting from distributed representations of lexical information. At the same time, humans can easily parse sentences with unknown or even novel words, as in Lewis Carroll’s poem Jabberwocky. In this paper, we carry out jabberwocky parsing experiments, exploring how robust a state-of-the-art neural network parser is to the absence of lexical information. We find that current parsing models, at least under usual training regimens, are in fact overly dependent on lexical information, and perform …


Acoustic Classification Of Focus: On The Web And In The Lab, Jonathan Howell, Mats Rooth, Michael Wagner Dec 2016

Acoustic Classification Of Focus: On The Web And In The Lab, Jonathan Howell, Mats Rooth, Michael Wagner

Jonathan Howell

We present a new methodological approach which combines both naturally-occurring speech harvested on the web and speech data elicited in the laboratory. This proof-of-concept study examines the phenomenon of focus sensitivity in English, in which the interpretation of particular grammatical constructions (e.g., the comparative) is sensitive to the location of prosodic prominence. Machine learning algorithms (support vector machines and linear discriminant analysis) and human perception experiments are used to cross-validate the web-harvested and lab-elicited speech. Results con rm the theoretical predictions for location of prominence in comparative clauses and the advantages using both web-harvested and lab-elicited speech. The most robust …


General Analysis Of An Online Language Corpus, Kerwin A. Livingstone May 2015

General Analysis Of An Online Language Corpus, Kerwin A. Livingstone

Kerwin A. Livingstone

Corpus-based research is rapidly gaining ground in the field of Applied Linguistics. More interesting is the evidence of many online language corpora which can be easily accessed, with just the click of the mouse. A quick navigation of the Web will produce different kinds of corpora in a vast number of language areas. Given the need to find new and exciting ways to improve the language learning and teaching process, corpus linguistics does have potential for generating significant learner experiences. Taking into consideration the above-mentioned, this paper deals with the general analysis of an online language corpus. The specific corpus …


Linguistics As Structure In Computer Animation: Toward A More Effective Synthesis Of Brow Motion In American Sign Language, Rosalee Wolfe, Peter Cook, John C. Mcdonald, Jerry Schnepp Feb 2015

Linguistics As Structure In Computer Animation: Toward A More Effective Synthesis Of Brow Motion In American Sign Language, Rosalee Wolfe, Peter Cook, John C. Mcdonald, Jerry Schnepp

Jerry C Schnepp

Computer-generated three-dimensional animation holds great promise for synthesizing utterances in American Sign Language (ASL) that are not only grammatical, but well tolerated by members of the Deaf community. Unfortunately, animation poses several challenges stemming from the necessity of grappling with massive amounts of data. However, the linguistics of ASL can aid in surmounting the challenge by providing structure and rules for organizing animation data. An exploration of the linguistic and extra linguistic behavior of the brows from an animator’s viewpoint yields a new approach for synthesizing nonmanuals that differs from the conventional animation of anatomy and instead offers a different …


Towards News Verification: Deception Detection Methods For News Discourse, Victoria Rubin, Niall Conroy, Yimin Chen Jan 2015

Towards News Verification: Deception Detection Methods For News Discourse, Victoria Rubin, Niall Conroy, Yimin Chen

Victoria Rubin

News verification is a process of determining whether a particular news report is truthful or deceptive. Deliberately deceptive (fabricated) news creates false conclusions in the readers’ minds. Truthful (authentic) news matches the writer’s knowledge. How do you tell the difference between the two in an automated way? To investigate this question, we analyzed rhetorical structures, discourse constituent parts and their coherence relations in deceptive and truthful news sample from NPR’s “Bluff the Listener”. Subsequently, we applied a vector space model to cluster the news by discourse feature similarity, achieving 63% accuracy. Our predictive model is not significantly better than chance …


Predicting Survey Responses: How And Why Semantics Shape Survey Statistics On Organizational Behaviour, Ketil Arnulf, Kai R. Larsen, Øyvind Martinsen, Chih How Bong Sep 2014

Predicting Survey Responses: How And Why Semantics Shape Survey Statistics On Organizational Behaviour, Ketil Arnulf, Kai R. Larsen, Øyvind Martinsen, Chih How Bong

Kai R.T. Larsen

Some disciplines in the social sciences rely heavily on collecting survey responses to detect empirical relationships among variables. We explored whether these relationships were a priori predictable from the semantic properties of the survey items, using language processing algorithms which are now available as new research methods. Language processing algorithms were used to calculate the semantic similarity among all items in state-of-the-art surveys from Organisational Behaviour research. These surveys covered areas such as transformational leadership, work motivation and work outcomes. This information was used to explain and predict the response patterns from real subjects. Semantic algorithms explained 60–86% of the …


Alternative Translation Approach – Part I: "Labor Division", Ludvig Glavati Mar 2014

Alternative Translation Approach – Part I: "Labor Division", Ludvig Glavati

Ludvig Glavati

No abstract provided.


Cecl: A New Baseline And A Non-Compositional Approach For The Sick Benchmark., Yves Bestgen Jan 2014

Cecl: A New Baseline And A Non-Compositional Approach For The Sick Benchmark., Yves Bestgen

Yves Bestgen

This paper describes the two procedures for determining the semantic similarities between sentences submitted for the SemEval 2014 Task 1. MeanMaxSim, an unsupervised procedure, is proposed as a new baseline to assess the efficiency gain provided by compositional models. It outperforms a number of other baselines by a wide margin. Compared to the word-overlap baseline, it has the advantage of taking into account the distributional similarity between words that are also involved in compositional models. The second procedure aims at building a predictive model using as predictors MeanMaxSim and (transformed) lexical features describing the differences between each sentence of a …


Quantifying The Development Of Phraseological Competence In L2 English Writing: An Automated Approach, Yves Bestgen, Sylviane Granger Jan 2014

Quantifying The Development Of Phraseological Competence In L2 English Writing: An Automated Approach, Yves Bestgen, Sylviane Granger

Yves Bestgen

Based on the large body of research that shows phraseology to be pervasive in language, this study aims to assess the role played by phraseological competence in the development of L2 writing proficiency and text quality assessment. We propose to use CollGram, a technique that assigns to each pair of contiguous words (bigrams) in a learner text two association scores (mutual information and t-score) computed on the basis of a large reference corpus, the Corpus of Contemporary American English. Applied to the Michigan State University Corpus of second language writing, CollGram shows a longitudinal decrease in the use of collocations …


Relation Between Harappan And Brahmi Scripts, Subhajit Kumar Ganguly Jan 2013

Relation Between Harappan And Brahmi Scripts, Subhajit Kumar Ganguly

Subhajit Kumar Ganguly

Around 45 odd signs out of the total number of Harappan signs found make up almost 100 percent of the inscriptions, in some form or other, as said earlier. Out of these 45 signs, around 40 are readily distinguishable. These form an almost exclusive and unique set. The primary signs are seen to have many variants, as in Brahmi. Many of these provide us with quite a vivid picture of their evolution, depending upon the factors of time, place and usefulness. Even minor adjustments in such signs, depending upon these factors, are noteworthy. Many of the signs in this list …


Maximizing Classification Accuracy In Native Language Identification, Scott Jarvis, Yves Bestgen, Steve Pepper Jan 2013

Maximizing Classification Accuracy In Native Language Identification, Scott Jarvis, Yves Bestgen, Steve Pepper

Yves Bestgen

This paper reports our contribution to the 2013 NLI Shared Task. The purpose of the task was to train a machine-learning system to identify the native-language affiliations of 1,100 texts written in English by nonnative speakers as part of a high-stakes test of gen- eral academic English proficiency. We trained our system on the new TOEFL11 corpus, which includes 11,000 essays written by nonnative speakers from 11 native-language backgrounds. Our final system used an SVM classifier with over 400,000 unique features consisting of lexical and POS n-grams occur- ring in at least two texts in the training set. Our system …


Evaluation Automatique De Textes Et Cohésion Lexicale, Yves Bestgen Jan 2012

Evaluation Automatique De Textes Et Cohésion Lexicale, Yves Bestgen

Yves Bestgen

(Article in French). Automatic essay grading is currently experiencing a growing popularity because of its importance in the field of education and, particularly, in foreign language learning. While several efficient systems have been developed over the last fifteen years, almost none of them take the discourse level into account. Recently, a few studies proposed to fill this gap by means of automatic indexes of lexical cohesion obtained from Latent Semantic Analysis, but the results were disappointing. Based on a well-known model of writing expertise, the present study proposes a new index of cohesion derived from work on the thematic segmentation …


What's In A Letter?, Aaron J. Schein Dec 2011

What's In A Letter?, Aaron J. Schein

Aaron J Schein

Sentiment analysis is a burgeoning field in natural language processing used to extract and categorize opinion in evaluative documents. We look at recommendation letters, which pose unique challenges to standard sentiment analysis systems. Our dataset is eighteen letters from applications to UMass Worcester Memorial Medical Center’s residency program in Obstetrics and Gynecology. Given a small dataset, we develop a method intended for use by domain experts to systematically explore their intuitions about the topical make-up of documents on which they make critical decisions. By leveraging WordNet and the WordNet Propagation algorithm, the method allows a user to develop topic seed …


Using Textual Features To Predict Popular Content On Digg, Paul H. Miller May 2011

Using Textual Features To Predict Popular Content On Digg, Paul H. Miller

Paul H Miller

Over the past few years, collaborative rating sites, such as Netflix, Digg and Stumble, have become increasingly prevalent sites for users to find trending content. I used various data mining techniques to study Digg, a social news site, to examine the influence of content on popularity. What influence does content have on popularity, and what influence does content have on users’ decisions? Overwhelmingly, prior studies have consistently shown that predicting popularity based on content is difficult and maybe even inherently impossible. The same submission can have multiple outcomes and content neither determines popularity, nor individual user decisions. My results show …


The Low Entropy Conjecture: The Challenges Of Modern Irish Nominal Declension, Robert Malouf, Farrell Ackerman Jan 2011

The Low Entropy Conjecture: The Challenges Of Modern Irish Nominal Declension, Robert Malouf, Farrell Ackerman

Robert Malouf

No abstract provided.


Computational Style Processing, Foaad Khosmood Dec 2010

Computational Style Processing, Foaad Khosmood

Foaad Khosmood

Our main thesis is that computational processing of natural language styles can be accomplished using corpus analysis methods and language transformation rules. We demonstrate this first by statistically modeling natural language styles, and second by developing tools that carry out style processing, and finally by running experiments using the tools and evaluating the results. Specifically, we present a model for style in natural languages, and demonstrate style processing in three ways: Our system analyzes styles in quantifiable terms according to our model (analysis), associates documents based on stylistic similarity to known corpora (classification) and manipulates texts to match a desired …


Prosodylab-Aligner: A Tool For Forced Alignment Of Laboratory Speech, Kyle Gorman, Jonathan Howell, Michael Wagner Dec 2010

Prosodylab-Aligner: A Tool For Forced Alignment Of Laboratory Speech, Kyle Gorman, Jonathan Howell, Michael Wagner

Jonathan Howell

The Penn Forced Aligner automates the alignment process using the Hidden Markov Model Toolkit (HTK). The core of Prosodylab-Aligner is align.py, a script which performs acoustic model training and alignment. This script automates calls to HTK and SoX, an open-source command-line tool which is capable of resampling audio. The included README file provides instructions for installing HTK and SoX on Linux and Mac OS X, and can also be run on Windows. During training, the model is initialized with flat-start monophones, which are then submitted to a single round of model estimation. Then, a tied-state 'small pause' model is inserted …


Distribution Of Complexities In The Vai Script, Andrij Rovenchak, Ján Mačutek Dec 2008

Distribution Of Complexities In The Vai Script, Andrij Rovenchak, Ján Mačutek

Charles L. Riley

In the paper, we analyze the distribution of complexities in the Vai script, an indigenous syllabic writing system from Liberia. It is found that the uniformity hypothesis for complexities fails for this script. The models using Poisson distribution for the number of components and hyper-Poisson distribution for connections provide good fits in the case of the Vai script.


Automated Diagnostic Writing Tests: Why? How?, Elena Cotos, Nick Pendar Jan 2008

Automated Diagnostic Writing Tests: Why? How?, Elena Cotos, Nick Pendar

Elena Cotos

Diagnostic language assessment can greatly benefit from a collaborative union of computer-assisted language testing (CALT) and natural language processing (NLP). Currently, most CALT applications mainly allow for inferences about L2 proficiency based on learners’ recognition and comprehension of linguistic input and hardly concern language production (Holland, Maisano, Alderks, & Martin, 1993). NLP is now at a stage where it can be used or adapted for diagnostic testing of learner production skills. This paper explores the viability of NLP techniques for the diagnosis of L2 writing by analyzing the state of the art in current diagnostic language testing, reviewing the existing …


Automatic Identification Of Discourse Moves In Scientific Article Introductions, Elena Cotos, Nick Pendar Jan 2008

Automatic Identification Of Discourse Moves In Scientific Article Introductions, Elena Cotos, Nick Pendar

Elena Cotos

This paper reports on the first stage of building an educational tool for international graduate students to improve their academic writing skills. Taking a text-categorization approach, we experimented with several models to automatically classify sentences in research article introductions into one of three rhetorical moves. The paper begins by situating the project within the larger framework of intelligent computer-assisted language learning. It then presents the details of the study with very encouraging results. The paper then concludes by commenting on how the system may be improved and how the project is intended to be pursued and evaluated.


The Variable Elision Of Unstressed Vowels In European Portuguese: A Case Study, David James Silva Dec 1993

The Variable Elision Of Unstressed Vowels In European Portuguese: A Case Study, David James Silva

David Silva

European varieties of Portuguese exhibit a process whereby unstressed vowels, particularly schwa, optionally undergo elision: an item such as idade ‘idea’ can be realized as [ida'd] and para Maria ‘for Maria’ may surface as [prɐmɐrí'ɐ]. While previous research in the study of phonological variation of this sort has typically focused on syntactic, morphological, functional, and segmental factors as the primary linguistic conditions for accurately characterizing variable processes (Guy 1980; Poplack & Walter 1986, among many others), less work has been done investigating the role of prosodic factors in this respect. Yet if one believes (along with Nespor and Vogel 1986, …