Open Access. Powered by Scholars. Published by Universities.®

Social and Behavioral Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 17 of 17

Full-Text Articles in Social and Behavioral Sciences

Designing A Russian Idiom-Annotated Corpus, Katsiaryna Aharodnik, Anna Feldman, Jing Peng Jan 2019

Designing A Russian Idiom-Annotated Corpus, Katsiaryna Aharodnik, Anna Feldman, Jing Peng

Department of Linguistics Faculty Scholarship and Creative Works

This paper describes the development of an idiom-annotated corpus of Russian. The corpus is compiled from freely available resources online and contains texts of different genres. The idiom extraction, annotation procedure, and a pilot experiment using the new corpus are outlined in the paper. Considering the scarcity of publicly available Russian annotated corpora, the corpus is a much-needed resource that can be utilized for literary and linguistic studies, pedagogy as well as for various Natural Language Processing tasks.


Detecting Censorable Content On Sina Weibo: A Pilot Study, Kei Yin Ng, Anna State Feldman 6557500, Chris Leberknight Jul 2018

Detecting Censorable Content On Sina Weibo: A Pilot Study, Kei Yin Ng, Anna State Feldman 6557500, Chris Leberknight

Department of Linguistics Faculty Scholarship and Creative Works

This study provides preliminary insights into the linguistic features that contribute to Internet censorship in mainland China. We collected a corpus of 344 censored and uncensored microblog posts that were published on Sina Weibo and built a Naive Bayes classifier based on the linguistic, topic-independent, features. The classifier achieves a 79.34% accuracy in predicting whether a blog post would be censored on Sina Weibo.


Acoustic Classification Of Focus: On The Web And In The Lab, Jonathan Howell, Mats Rooth, Michael Wagner Jan 2017

Acoustic Classification Of Focus: On The Web And In The Lab, Jonathan Howell, Mats Rooth, Michael Wagner

Department of Linguistics Faculty Scholarship and Creative Works

We present a new methodological approach which combines both naturally-occurring speech harvested on the web and speech data elicited in the laboratory. This proof-of-concept study examines the phenomenon of focus sensitivity in English, in which the interpretation of particular grammatical constructions (e.g., the comparative) is sensitive to the location of prosodic prominence. Machine learning algorithms (support vector machines and linear discriminant analysis) and human perception experiments are used to cross-validate the web-harvested and lab-elicited speech. Results con rm the theoretical predictions for location of prominence in comparative clauses and the advantages using both web-harvested and lab-elicited speech. The most robust …


T. S. Eliot’S ‘Obscurity’ In The Love Song Of J. Alfred Prufrock, Longxing Wei Jan 2016

T. S. Eliot’S ‘Obscurity’ In The Love Song Of J. Alfred Prufrock, Longxing Wei

Department of Linguistics Faculty Scholarship and Creative Works

T. S. Eliot’s earliest verse is composed of observations, detached, ironic, and alternatively disillusioned and nostalgic in tone. Eliot’s mingling of subtle observation with unexpected cliché represents a difficulty that is often magnified because too much ’obscurity’ is assumed. This paper aims at clarifying the ’obscurity’ by means of a stylistic analysis of the linguistic devices that the poet used to create "The Love Song of J. Alfred Prufrock" and its intended meaning. Adopting the concept of style as ’foregrounding’, the idea that style is constituted by departures from linguistic norms, it analyzes the poem in terms of its lexical …


Automatic Detection Of Idiomatic Clauses, Anna Feldman, Jing Peng Mar 2013

Automatic Detection Of Idiomatic Clauses, Anna Feldman, Jing Peng

Department of Linguistics Faculty Scholarship and Creative Works

We describe several experiments whose goal is to automatically identify idiomatic expressions in written text. We explore two approaches for the task: 1) idiom recognition as outlier detection; and 2) supervised classification of sentences. We apply principal component analysis for outlier detection. Detecting idioms as lexical outliers does not exploit class label information. So, in the following experiments, we use linear discriminant analysis to obtain a discriminant subspace and later use the three nearest neighbor classifier to obtain accuracy. We discuss pros and cons of each approach. All the approaches are more general than the previous algorithms for idiom detection …


Prosodylab-Aligner: A Tool For Forced Alignment Of Laboratory Speech, Kyle Gorman, Jonathan Howell, Michael Wagner Jan 2011

Prosodylab-Aligner: A Tool For Forced Alignment Of Laboratory Speech, Kyle Gorman, Jonathan Howell, Michael Wagner

Department of Linguistics Faculty Scholarship and Creative Works

The Penn Forced Aligner automates the alignment process using the Hidden Markov Model Toolkit (HTK). The core of Prosodylab-Aligner is align.py, a script which performs acoustic model training and alignment. This script automates calls to HTK and SoX, an open-source command-line tool which is capable of resampling audio. The included README file provides instructions for installing HTK and SoX on Linux and Mac OS X, and can also be run on Windows. During training, the model is initialized with flat-start monophones, which are then submitted to a single round of model estimation. Then, a tied-state 'small pause' model is inserted …


Challenges Of Cheap Resource Creation For Morphological Tagging, Jirka Hana, Anna Feldman Jul 2010

Challenges Of Cheap Resource Creation For Morphological Tagging, Jirka Hana, Anna Feldman

Department of Linguistics Faculty Scholarship and Creative Works

We describe the challenges of resource creation for a resource-light system for morphological tagging of fusional languages (Feldman and Hana, 2010). The constraints on resources (time, expertise, and money) introduce challenges that are not present in development of morphological tools and corpora in the usual, resource intensive way.


Second Occurrence Focus And The Acoustics Of Prominence, Jonathan Howell Jan 2009

Second Occurrence Focus And The Acoustics Of Prominence, Jonathan Howell

Department of Linguistics Faculty Scholarship and Creative Works

Partee (1991) challenged the significance of the observation that certain adverbs (e.g., only) reliably associate with phonologically prominent words to truth‐conditional effect, noting second occurrence (i.e., repeated or given) focus (SOF) appears to lack a phonological realization. Rooth (1996), Bartels (2004), Beaver et al. (2004), Jaeger (2004), and Fry and Ishihara (2005) argued that, while not intonationally prominent, an SOF word can be marked by increased duration and/or increased rms intensity. An acoustic study of verb‐noun homophone pairs is reported. Three sophisticated speakers uttered five repetitions of the targets, embedded in discourses, in first occurrence (FOF), SOF, and unfocused (NF) …


Arida: An Arabic Interlanguage Database And Its Applications: A Pilot Study, Anna Feldman, Ghazi Abuhakema, Eileen Fitzpatrick Nov 2008

Arida: An Arabic Interlanguage Database And Its Applications: A Pilot Study, Anna Feldman, Ghazi Abuhakema, Eileen Fitzpatrick

Department of Linguistics Faculty Scholarship and Creative Works

This paper describes a pilot study in which we collected a small learner corpus of Arabic, developed a tagset for error-annotation of Arabic learner data, tagged the data for error 1, and performed simple Computer-aided Error Analysis (CEA).


Verification And Implementation Of Language-Based Deception Indicators In Civil And Criminal Narratives, Joan Bachenko, Eileen Fitzpatrick, Michael Schonwetter Aug 2008

Verification And Implementation Of Language-Based Deception Indicators In Civil And Criminal Narratives, Joan Bachenko, Eileen Fitzpatrick, Michael Schonwetter

Department of Linguistics Faculty Scholarship and Creative Works

Our goal is to use natural language processing to identify deceptive and non-deceptive passages in transcribed narratives. We begin by motivating an analysis of language-based deception that relies on specific linguistic indicators to discover deceptive statements. The indicator tags are assigned to a document using a mix of automated and manual methods. Once the tags are assigned, an interpreter automatically discriminates between deceptive and truthful statements based on tag densities. The texts used in our study come entirely from "real world" sources-criminal statements, police interrogations and legal testimony. The corpus was hand-tagged for the truth value of all propositions that …


Annotating An Arabic Learner Corpus For Error, Ghazi Abuhakema, Reem Faraj, Anna Feldman, Eileen Fitzpatrick May 2008

Annotating An Arabic Learner Corpus For Error, Ghazi Abuhakema, Reem Faraj, Anna Feldman, Eileen Fitzpatrick

Department of Linguistics Faculty Scholarship and Creative Works

This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the distance between the two languages with respect to learner difficulty. The current collection of texts, which is constantly growing, contains intermediate and advanced-level …


Designing And Evaluating A Russian Tagset, Serge Sharoff, Mikhail Kopotev, Tomaž Erjavec, Anna Feldman, Dagmar Divjak Jan 2008

Designing And Evaluating A Russian Tagset, Serge Sharoff, Mikhail Kopotev, Tomaž Erjavec, Anna Feldman, Dagmar Divjak

Department of Linguistics Faculty Scholarship and Creative Works

This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.


A Cross-Language Approach To Rapid Creation Of New Morpho-Syntactically Annotated Resources, Anna Feldman, Jirka Hana, Chris Brew Jan 2006

A Cross-Language Approach To Rapid Creation Of New Morpho-Syntactically Annotated Resources, Anna Feldman, Jirka Hana, Chris Brew

Department of Linguistics Faculty Scholarship and Creative Works

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. …


The Multilingual Mental Lexicon And Lemma Transfer In Third Language Learning, Longxing Wei Jan 2006

The Multilingual Mental Lexicon And Lemma Transfer In Third Language Learning, Longxing Wei

Department of Linguistics Faculty Scholarship and Creative Works

From some psycholinguistic perspectives, this study examines language transfer by exploring the nature of the multilingual mental lexicon in relation to sources of language transfer. It assumes that the multilingual mental lexicon contains not only lexemes but also language-specific lemmas; language-specific lemmas may activate language-specific morphosyntactic procedures in speech production, and third language learners' activation of lemmas for target language items may be influenced by the lemmas already stored in their mental lexicon through their previous language acquisition, especially second language acquisition. The interlanguage data for the study are from adult learners with Chinese as their first language, English as …


The Bilingual Mental Lexicon And Speech Production Process, Longxing Wei Jan 2002

The Bilingual Mental Lexicon And Speech Production Process, Longxing Wei

Department of Linguistics Faculty Scholarship and Creative Works

The Chinese/English intrasentential code-switching data provide evidence that the bilingual mental lexicon involves language contact between language-specific semantic/pragmatic feature bundles. Lemmas in the mental lexicon are tagged for specific languages and contain semantic, syntactic, and morphological information about lexemes. In a bilingual mode, the speaker makes choices at the preverbal level of lexical-conceptual structure, and these choices activate the lemmas in the mental lexicon for the speaker's preverbal message to be morpho-syntactically realized at the functional level of predicate-argument structure. The result will be language-specific surface forms at the positional level of morphological realization patterns. The languages involved in the …


Types Of Morphemes And Their Implications For Second Language Morpheme Acquisition, Longxing Wei Mar 2000

Types Of Morphemes And Their Implications For Second Language Morpheme Acquisition, Longxing Wei

Department of Linguistics Faculty Scholarship and Creative Works

This paper explains observed morpheme accuracy orders on the basis ofa model of morpheme classification,the 4-M model proposed by Myers-Scotton and Jake(2000). It argues that the adult second language morpheme acquisition order is determined by how morphemes are projected from the mental lexicon. Four types of morphemes are identified: content morphemes,early system morphemes, and two types of late system morphemes. Early system morphemes are indirectly elected at the same time that content morphemes are directly elected by the speaker's intentions. Late system morphemes are activated later in the production process as required by the grammatical frame of the target language. …


Parsing For Prosody: What A Text-To-Speech System Needs From Syntax, Eileen Fitzpatrick, Joan Bachenko Dec 1989

Parsing For Prosody: What A Text-To-Speech System Needs From Syntax, Eileen Fitzpatrick, Joan Bachenko

Department of Linguistics Faculty Scholarship and Creative Works

The authors describe an experimental text-to-speech system that uses a syntactic parser and prosody rules to determine prosodic phrasing for synthesized speech. It is shown that many aspects of sentence analysis that are required for other parsing applications, e.g., machine translation and question answering, become unnecessary in parsing for text-to-speech. It is possible to generate natural-sounding prosodic phrasing by relying on information about syntactic category type, partial constituency, and length; information about clausal and verb phrase constituency, predicate-argument relations, and prepositional phrase attachment can be bypassed.