Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

Arts and Humanities

Theses/Dissertations

Articles 1 - 16 of 16

Full-Text Articles in Computational Linguistics

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch Jun 2024

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch

Dissertations, Theses, and Capstone Projects

Written modern Hebrew presents a unique challenge for training computational models for language processing because modern Hebrew text often lacks vocalization. The lack of available vocalized Hebrew data can lead to ambiguity in training these models and generally hinders work on natural language processing problems. The goal of this project is to contribute to the collection of vocalized Hebrew text by collecting and preprocessing a large corpus of unvocalized Hebrew text and building an online annotation tool. The annotation tool allows people to upload unvocalized Hebrew text, to annotate by adding Hebrew vocalization, and to download comma-separated values files of …


Destined Failure, Chengjun Pan Jun 2023

Destined Failure, Chengjun Pan

Masters Theses

I attempt to examine the complex structure of human communication, explaining why it is bound to fail. By reproducing experienceable phenomena, I demonstrate how they can expose communication structure and reveal the limitations of our perception and symbolization.I divide the process of communication into six stages: input, detection, symbolization, dictionary, interpretation, and output. In this thesis, I examine the flaws and challenges that arise in the first five stages. I argue that reception acts as a filter and that understanding relies on a symbolic system that is full of redundancies. Therefore, every interpretation is destined to be a deviation.


Covert Determiners In Appalachian English Narrative Declarative Sentences, William Oliver Jun 2022

Covert Determiners In Appalachian English Narrative Declarative Sentences, William Oliver

Dissertations, Theses, and Capstone Projects

In this thesis, I explore the syntax and semantics of covert determiners (Ds) in matrix subject determiner phrases (DPs) with definite specific interpretations. To conduct my investigation, I used the Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE), a million-word Penn Treebank corpus, and the software CorpusSearch, a Java program that searches Penn Treebank corpora. My research shows that Appalachian English contains a linguistic phenomenon where speakers drop the D, replacing overt Ds with covert Ds, in definite specific DPs. For example, where Standard English speakers say The doctor came by horseback, Appalachian speakers may use a covert D …


Detection And Morphological Analysis Of Novel Russian Loanwords, Yulia Spektor Sep 2021

Detection And Morphological Analysis Of Novel Russian Loanwords, Yulia Spektor

Dissertations, Theses, and Capstone Projects

This paper investigates recent English loanwords in Russian and explores ways in which computational methods can help further theoretical research. The goal of the study is two-fold: to find new, previously unattested loanwords borrowed over the last decade and to examine the rate of adaptation of the new borrowings, attested by the degree to which they conform to the constraints of the Russian language. First, we train a finite-state pipeline that combines character n-gram language models, which encode phonotactic and lexical properties of loanwords, with a binary classifier to detect loanwords. The model achieves state-of-the-art performance results during evaluation, surpassing …


The Public Innovations Explorer: A Geo-Spatial & Linked-Data Visualization Platform For Publicly Funded Innovation Research In The United States, Seth Schimmel Jun 2021

The Public Innovations Explorer: A Geo-Spatial & Linked-Data Visualization Platform For Publicly Funded Innovation Research In The United States, Seth Schimmel

Dissertations, Theses, and Capstone Projects

The Public Innovations Explorer (https://sethsch.github.io/innovations-explorer/app/index.html) is a web-based tool created using Node.js, D3.js and Leaflet.js that can be used for investigating awards made by Federal agencies and departments participating in the Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) grant-making programs between 2008 and 2018. By geocoding the publicly available grants data from SBIR.gov, the Public Innovations Explorer allows users to identify companies performing publicly-funded innovative research in each congressional district and obtain dynamic district-level summaries of funding activity by agency and year. Applying spatial clustering techniques on districts' employment levels across major economic sectors provides users …


Plprepare: A Grammar Checker For Challenging Cases, Jacob Hoyos May 2021

Plprepare: A Grammar Checker For Challenging Cases, Jacob Hoyos

Electronic Theses and Dissertations

This study investigates one of the Polish language’s most arbitrary cases: the genitive masculine inanimate singular. It collects and ranks several guidelines to help language learners discern its proper usage and also introduces a framework to provide detailed feedback regarding arbitrary cases. The study tests this framework by implementing and evaluating a hybrid grammar checker called PLPrepare. PLPrepare performs similarly to other grammar checkers and is able to detect genitive case usages and provide feedback based on a number of error classifications.


When Misclassification Is Misgendering: Gender Prediction In The Context Of Trans Identities, Sean Miller Feb 2021

When Misclassification Is Misgendering: Gender Prediction In The Context Of Trans Identities, Sean Miller

Dissertations, Theses, and Capstone Projects

As a subdomain of author profiling, gender prediction (sometimes called gender inference) has received a substantial amount of attention—both as a task in itself, and for other downstream analyses. Throughout the existing literature various statistical and machine learning methods have been applied to extract features in order to either characterize and differentiate female and male writing styles, or simply to achieve maximum accuracy on gender prediction as a binary classification task. However, researchers often do not disclose how they conceptualize gender nor do they consider the implications that gender prediction has for non-binary and trans individuals. Along with an overview …


A Computational Study In The Detection Of English–Spanish Code-Switches, Yohamy C. Polanco Feb 2021

A Computational Study In The Detection Of English–Spanish Code-Switches, Yohamy C. Polanco

Dissertations, Theses, and Capstone Projects

Code-switching is the linguistic phenomenon where a multilingual person alternates between two or more languages in a conversation, whether that be spoken or written. This thesis studies the automatic detection of code-switching occurring specifically between English and Spanish in two corpora.

Twitter and other social media sites have provided an abundance of linguistic data that is available to researchers to perform countless experiments. Collecting the data is fairly easy if a study is on monolingual text, but if a study requires code-switched data, this becomes a complication as APIs only accept one language as a parameter. This thesis focuses on …


On Polysemy: A Philosophical, Psycholinguistic, And Computational Study, Jiangtian Li Aug 2020

On Polysemy: A Philosophical, Psycholinguistic, And Computational Study, Jiangtian Li

Electronic Thesis and Dissertation Repository

Most words in natural languages are polysemous, that is they have related but different meanings in different contexts. These polysemous meanings (senses) are marked by their structuredness, flexibility, productivity, and regularity. Previous theories have focused on some of these features but not all of them together. Thus, I propose a new theory of polysemy, which has two components. First, word meaning is actively modulated by broad contexts in a continuous fashion. Second, clustering arises from contextual modulations of a word and is then entrenched in our long term memory to facilitate future production and processing. Hence, polysemous senses are entrenched …


Automatic Keyphrase Extraction From Russian-Language Scholarly Papers In Computational Linguistics, Yves Wienecke Jul 2020

Automatic Keyphrase Extraction From Russian-Language Scholarly Papers In Computational Linguistics, Yves Wienecke

University Honors Theses

The automatic extraction of keyphrases from scholarly papers is a necessary step for many Natural Language Processing (NLP) tasks, including text retrieval, machine translation, and text summarization. However, due to the different grammatical and semantic intricacies of languages, this is a highly language-dependent task. Many free and open source implementations of state-of-the-art keyphrase extraction techniques exist, but they are not adapted for processing Russian text. Furthermore, the multi-linguistic character of scholarly papers in the field of Russian computational linguistics and NLP introduces additional complexity to keyphrase extraction. This paper describes a free and open source program as a proof of …


Losing Shahrazad: A Distant Reading Of 1001 Nights, Taysa Mohler Jan 2018

Losing Shahrazad: A Distant Reading Of 1001 Nights, Taysa Mohler

Senior Projects Spring 2018

This project is a distant reading analysis of seven 19th and 20th-century English translations of One Thousand and One Nights or The Arabian Nights. Through the use of computer programming and distant reading, it becomes clear that the Nights' frame tale is the carrier of the internal logic and generative power of the story cycle. Further, the frame tale expresses the Nights' self-representation, which serves to undermine the historical use of the Nights as synecdoche for the Orient. Therefore, the translators that remove the frame story from their versions further the Nights' use as an Orientalist object, …


Generating Amharic Present Tense Verbs: A Network Morphology & Datr Account, T. Michael W. Halcomb Jan 2017

Generating Amharic Present Tense Verbs: A Network Morphology & Datr Account, T. Michael W. Halcomb

Theses and Dissertations--Linguistics

In this thesis I attempt to model, that is, computationally reproduce, the natural transmission (i.e. inflectional regularities) of twenty present tense Amharic verbs (i.e. triradicals beginning with consonants) as used by the language’s speakers. I root my approach in the linguistic theory of network morphology (NM) and model it using the DATR evaluator. In Chapter 1, I provide an overview of Amharic and discuss the fidel as an abugida, the verb system’s root-and-pattern morphology, and how radicals of each lexeme interacts with prefixes and suffixes. I offer an overview of NM in Chapter 2 and DATR in Chapter 3. In …


Misheard Me Oronyminator: Using Oronyms To Validate The Correctness Of Frequency Dictionaries, Jennifer G. Hughes Jun 2013

Misheard Me Oronyminator: Using Oronyms To Validate The Correctness Of Frequency Dictionaries, Jennifer G. Hughes

Master's Theses

In the field of speech recognition, an algorithm must learn to tell the difference between "a nice rock" and "a gneiss rock". These identical-sounding phrases are called oronyms. Word frequency dictionaries are often used by speech recognition systems to help resolve phonetic sequences with more than one possible orthographic phrase interpretation, by looking up which oronym of the root phonetic sequence contains the most-common words.

Our paper demonstrates a technique used to validate word frequency dictionary values. We chose to use frequency values from the UNISYN dictionary, which tallies each word on a per-occurance basis, using a proprietary text corpus, …


Statistical Machine Translation Of Japanese, Erik A. Chapla Mar 2007

Statistical Machine Translation Of Japanese, Erik A. Chapla

Theses and Dissertations

The purpose of this research was to find ways to improve the performance of a statistical machine translation system that translates text from Japanese to English. Methods included altering the training and test data by adding a prior linguistic knowledge, altering sentence structures, and looking for better ways to statistically alter the way words align between the two languages. In addition, methods for properly segmenting words in Japanese text through statistical methods were examined. Finally, experiments were conducted on Japanese speech to produce the best text transcription of the speech. The best statistical machine translation methods implemented resulted in improvements …


Multilingual Phoneme Models For Rapid Speech Processing System Development, Eric G. Hansen Sep 2006

Multilingual Phoneme Models For Rapid Speech Processing System Development, Eric G. Hansen

Theses and Dissertations

Current speech recognition systems tend to be developed only for commercially viable languages. The resources needed for a typical speech recognition system include hundreds of hours of transcribed speech for acoustic models and 10 to 100 million words of text for language models; both of these requirements can be costly in time and money. The goal of this research is to facilitate rapid development of speech systems to new languages by using multilingual phoneme models to alleviate requirements for large amounts of transcribed speech. The Global Phone database, winch contains transcribed speech from 15 languages, is used as source data …


Speech Recognition Using The Mellin Transform, Jesse R. Hornback Mar 2006

Speech Recognition Using The Mellin Transform, Jesse R. Hornback

Theses and Dissertations

The purpose of this research was to improve performance in speech recognition. Specifically, a new approach was investigating by applying an integral transform known as the Mellin transform (MT) on the output of an auditory model to improve the recognition rate of phonemes through the scale-invariance property of the Mellin transform. Scale-invariance means that as a time-domain signal is subjected to dilations, the distribution of the signal in the MT domain remains unaffected. An auditory model was used to transform speech waveforms into images representing how the brain "sees" a sound. The MT was applied and features were extracted. The …