Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

Theses/Dissertations

Discipline
Institution
Keyword
Publication Year
Publication

Articles 1 - 30 of 109

Full-Text Articles in Computational Linguistics

Computational Approaches To Linguistic Challenges In Arabic Speech Recognition, Enas Albasiri Jun 2024

Computational Approaches To Linguistic Challenges In Arabic Speech Recognition, Enas Albasiri

Dissertations, Theses, and Capstone Projects

This dissertation aims to document the linguistic features of Arabic that pose challenges to speech and language technologies and advance these technologies by developing state-of-the-art computational tools focusing on automatic speech recognition (ASR), text normalization (TN), and corpus development. TN converts expressions such as numbers, dates, and times—named semiotic classes—from their written to their spoken domain, such as converting ‘$84.00’ to ‘eighty-four dollars’, while inverse text normalization (ITN) converts verbalized text to its written form. This conversion is an essential preprocessing step for text-to-speech (TTS), and post-processing step for ASR. Arabic presents a challenge for TN and ITN because one …


Uncovering The Mimicry Of Online Review Breadth And Depth And Its Subsequent Effect On Consumer Responses, Andrea Pelaez Martinez Jun 2024

Uncovering The Mimicry Of Online Review Breadth And Depth And Its Subsequent Effect On Consumer Responses, Andrea Pelaez Martinez

Dissertations, Theses, and Capstone Projects

Word-of-mouth (WOM) in marketing occurs when consumers discuss a company's product or service or any consumption experience with their friends, family, and others with whom they have any relationship. With the advent of social media, this phenomenon has expanded rapidly into virtual environments where consumer conversation is enabled through chats, forums, social media posts, and online reviews. In response to this rapid growth of online WOM, academics and practitioners have focused their interest on this phenomenon and its implications on consumers, firms, and society. So far, the evidence of the critical role that online WOM plays in helping consumers make …


Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch Jun 2024

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch

Dissertations, Theses, and Capstone Projects

Written modern Hebrew presents a unique challenge for training computational models for language processing because modern Hebrew text often lacks vocalization. The lack of available vocalized Hebrew data can lead to ambiguity in training these models and generally hinders work on natural language processing problems. The goal of this project is to contribute to the collection of vocalized Hebrew text by collecting and preprocessing a large corpus of unvocalized Hebrew text and building an online annotation tool. The annotation tool allows people to upload unvocalized Hebrew text, to annotate by adding Hebrew vocalization, and to download comma-separated values files of …


Consonant (De)Gradation In Ingrian?, Andrea M. Harrison Feb 2024

Consonant (De)Gradation In Ingrian?, Andrea M. Harrison

Dissertations, Theses, and Capstone Projects

This paper will present a dual method toward data enrichment for low-resource languages. Using Yoyodyne -- a Fairseq-inspired neural library for small-vocabulary sequence-to-sequence generation -- a morphological generation task was tested across labeled data encompassing multiple stages of enrichment for the low-resource language Ingrian. Due to limitations in the available data for Ingrian, weighted finite-state transducers (WFSTs) were used to generate an expanded vocabulary via HFST's toolkit for Uralic languages, and GiellaLT, a source for FST-driven lexica for low-resource languages. Further stages of experimentation used labeled data from related, higher-resource languages (Finnish, Estonian) to encourage cross-lingual transfer in the interest …


How Do We Learn What We Cannot Say?, Daniel Yakubov Feb 2024

How Do We Learn What We Cannot Say?, Daniel Yakubov

Dissertations, Theses, and Capstone Projects

The contributions of this thesis are two-fold. First, this thesis presents UDTube, an easily usable software developed to perform morphological analysis in a multi-task fashion. This work shows the strong performance of UDTube versus the current state-of-the-art, UDPipe, across eight languages, primarily in the annotation of morphological features. The second contribution of this thesis is a exploration into the study of defectivity. UDTube is used to annotate a large amount of data in Greek and Russian which is ultimately used to investigate the plausibility of Indirect Negative Evidence (INE), a popular approach to the acquisition of morphological defectivity. The reported …


A Computer-Assisted Approach To Lexical Borrowing In Northeast Caucasian Languages, Bonnie Eleanor Wren-Hardin Jan 2024

A Computer-Assisted Approach To Lexical Borrowing In Northeast Caucasian Languages, Bonnie Eleanor Wren-Hardin

Theses and Dissertations--Linguistics

The disambiguation of loanwords and cognates can be a challenge, especially in areas where there has been intense language contact over an extended period of time, when the contact is between genetically related languages, and when the number of languages involved is large Over the past several decades, more and more computational approaches to automatic cognate and borrowing detection have been created in an attempt to ease the load of examining hundreds to thousands of individual lexemes, as well as determine language family relationships with allegedly greater accuracy. While these methods are not perfect and cannot replace the knowledge or …


The Near-Synonymous Classifiers In Mandarin Chinese: Etymology, Modern Usage, And Possible Problems In L2 Classroom, Irina Kavokina Nov 2023

The Near-Synonymous Classifiers In Mandarin Chinese: Etymology, Modern Usage, And Possible Problems In L2 Classroom, Irina Kavokina

Masters Theses

Many Chinese classifiers are nearly synonymic – they can be used with the same head nouns without changing the meaning of the sentence, in other words, such classifiers can be used interchangeably or almost interchangeably. This poses a challenge for Chinese language learners, especially those who lack such a grammatical category in their own native language. Another complication arises from the ambiguous English translations of many classifiers.

In this paper we investigate the collocation behavior of near-synonymous Chinese classifiers, focusing on their semantic nuances and interchangeability. Analyzing 6 pairs of classifiers — 栋 and 幢, 匹 and 头, 批 and …


Towards Interpretable Machine Reading Comprehension With Mixed Effects Regression And Exploratory Prompt Analysis, Luca Del Signore Sep 2023

Towards Interpretable Machine Reading Comprehension With Mixed Effects Regression And Exploratory Prompt Analysis, Luca Del Signore

Dissertations, Theses, and Capstone Projects

We investigate the properties of natural language prompts that determine their difficulty in machine reading comprehension tasks. While much work has been done benchmarking language model performance at the task level, there is considerably less literature focused on how individual task items can contribute to interpretable evaluations of natural language understanding. Such work is essential to deepening our understanding of language models and ensuring their responsible use as a key tool in human machine communication. We perform an in depth mixed effects analysis on the behavior of three major generative language models, comparing their performance on a large reading comprehension …


Destined Failure, Chengjun Pan Jun 2023

Destined Failure, Chengjun Pan

Masters Theses

I attempt to examine the complex structure of human communication, explaining why it is bound to fail. By reproducing experienceable phenomena, I demonstrate how they can expose communication structure and reveal the limitations of our perception and symbolization.I divide the process of communication into six stages: input, detection, symbolization, dictionary, interpretation, and output. In this thesis, I examine the flaws and challenges that arise in the first five stages. I argue that reception acts as a filter and that understanding relies on a symbolic system that is full of redundancies. Therefore, every interpretation is destined to be a deviation.


Neural Network Vs. Rule-Based G2p: A Hybrid Approach To Stress Prediction And Related Vowel Reduction In Bulgarian, Maria Karamihaylova Jun 2023

Neural Network Vs. Rule-Based G2p: A Hybrid Approach To Stress Prediction And Related Vowel Reduction In Bulgarian, Maria Karamihaylova

Dissertations, Theses, and Capstone Projects

An effective grapheme-to-phoneme (G2P) conversion system is a critical element of speech synthesis. Rule-based systems were an early method for G2P conversion. In recent years, machine learning tools have been shown to outperform rule-based approaches in G2P tasks. We investigate neural network sequence-to-sequence modeling for the prediction of syllable stress and resulting vowel reductions in the Bulgarian language. We then develop a hybrid G2P approach which combines manually written grapheme-to-phoneme mapping rules with neural network-enabled syllable stress predictions by inserting stress markers in the predicted stress position of the transcription produced by the rule-based finite-state transducer. Finally, we apply vowel …


Evaluating Neural Networks As Cognitive Models For Learning Quasi-Regularities In Language, Xiaomeng Ma Jun 2023

Evaluating Neural Networks As Cognitive Models For Learning Quasi-Regularities In Language, Xiaomeng Ma

Dissertations, Theses, and Capstone Projects

Many aspects of language can be categorized as quasi-regular: the relationship between the inputs and outputs is systematic but allows many exceptions. Common domains that contain quasi-regularity include morphological inflection and grapheme-phoneme mapping. How humans process quasi-regularity has been debated for decades. This thesis implemented modern neural network models, transformer models, on two tasks: English past tense inflection and Chinese character naming, to investigate how transformer models perform quasi-regularity tasks. This thesis focuses on investigating to what extent the models' performances can represent human behavior. The results show that the transformers' performance is very similar to human behavior in many …


Topics For He But Not For She: Quantifying And Classifying Gender Bias In The Media, Tyler J. Lanni Jun 2023

Topics For He But Not For She: Quantifying And Classifying Gender Bias In The Media, Tyler J. Lanni

Dissertations, Theses, and Capstone Projects

In this study, we used computational techniques to analyze the language used in news articles to describe female and male politicians. Our corpus included 370 subtexts for male candidates and 374 subtexts for female candidates, gathered through the New York Times API. We conducted two experiments: an LDA topic analysis to explore the data, and a logistic regression to classify the subtexts as either male or female. Our analysis revealed some noteworthy findings that suggest the possibility of developing a gender bias classifier in the future. However, to create a more robust understanding of bias, additional research and data are …


Ai Approaches To Understand Human Deceptions, Perceptions, And Perspectives In Social Media, Chih-Yuan Li May 2023

Ai Approaches To Understand Human Deceptions, Perceptions, And Perspectives In Social Media, Chih-Yuan Li

Dissertations

Social media platforms have created virtual space for sharing user generated information, connecting, and interacting among users. However, there are research and societal challenges: 1) The users are generating and sharing the disinformation 2) It is difficult to understand citizens' perceptions or opinions expressed on wide variety of topics; and 3) There are overloaded information and echo chamber problems without overall understanding of the different perspectives taken by different people or groups.

This dissertation addresses these three research challenges with advanced AI and Machine Learning approaches. To address the fake news, as deceptions on the facts, this dissertation presents Machine …


Predicting High-Cap Tech Stock Polarity: A Combined Approach Using Support Vector Machines And Bidirectional Encoders From Transformers, Ian L. Grisham May 2023

Predicting High-Cap Tech Stock Polarity: A Combined Approach Using Support Vector Machines And Bidirectional Encoders From Transformers, Ian L. Grisham

Electronic Theses and Dissertations

The abundance, accessibility, and scale of data have engendered an era where machine learning can quickly and accurately solve complex problems, identify complicated patterns, and uncover intricate trends. One research area where many have applied these techniques is the stock market. Yet, financial domains are influenced by many factors and are notoriously difficult to predict due to their volatile and multivariate behavior. However, the literature indicates that public sentiment data may exhibit significant predictive qualities and improve a model’s ability to predict intricate trends. In this study, momentum SVM classification accuracy was compared between datasets that did and did not …


Single-Case Pilot Study For Longitudinal Analysis Of Referential Failures And Sentiment In Schizophrenic Speech From Client-Centered Psychotherapy Recordings, Travis A. Musich Apr 2023

Single-Case Pilot Study For Longitudinal Analysis Of Referential Failures And Sentiment In Schizophrenic Speech From Client-Centered Psychotherapy Recordings, Travis A. Musich

Dissertations

Though computational linguistic analyses have revealed the presence of distinctly characteristic language features in schizophrenic disordered speech, the relative stability of these language features in longitudinal samples is still unknown. This longitudinal pilot study analyzed schizophrenic disordered speech data from the archival therapy audio recordings of one patient spanning 23 years. End-to-end Neural Coreference Resolution software was used to analyze transcribed speech data from three therapy sessions to identify ambiguous pronouns, referred to as referential failures, which were reviewed and confirmed by multiple raters. Speech samples were analyzed using Google Cloud Natural Language API software for sentiment variables (i.e., score, …


A Sentiment Analysis Of "Filipinx" On Twitter Using A Multinomial Naïve Bayes Classification Model, Clarisse Taboy Feb 2023

A Sentiment Analysis Of "Filipinx" On Twitter Using A Multinomial Naïve Bayes Classification Model, Clarisse Taboy

Dissertations, Theses, and Capstone Projects

On social media, the use of “Filipinx” as a gender neutral, inclusive term for “Filipino” tends to generate high user engagement, at times without regard for the original context in which the word appears. This project applies computational methods to collect a large dataset in English/Filipino from Twitter containing “Filipinx”, and to train a Naïve Bayes model to classify tweets into three sentiments: positive, neutral, and negative. My methodology takes inspiration from that of four related studies that similarly conducted sentiment analysis on English/Filipino tweets involving various topics, and whose resulting accuracy scores were compared side-by-side. Conducting sentiment analysis on …


Evaluation Of Different Machine Learning, Deep Learning And Text Processing Techniques For Hate Speech Detection, Nabil Shawkat Jan 2023

Evaluation Of Different Machine Learning, Deep Learning And Text Processing Techniques For Hate Speech Detection, Nabil Shawkat

MSU Graduate Theses

Social media has become a domain that involves a lot of hate speech. Some users feel entitled to engage in abusive conversations by sending abusive messages, tweets, or photos to other users. It is critical to detect hate speech and prevent innocent users from becoming victims. In this study, I explore the effectiveness and performance of various machine learning methods employing text processing techniques to create a robust system for hate speech identification. I assess the performance of Naïve Bayes, Support Vector Machines, Decision Trees, Random Forests, Logistic Regression, and K Nearest Neighbors using three distinct datasets sourced from social …


Automatic Transcription Of Northern Prinmi Oral Art: Approaches And Challenges To Automatic Speech Recognition For Language Documentation, Connor Bechler Jan 2023

Automatic Transcription Of Northern Prinmi Oral Art: Approaches And Challenges To Automatic Speech Recognition For Language Documentation, Connor Bechler

Theses and Dissertations--Linguistics

One significant issue facing language documentation efforts is the transcription bottleneck: each documented recording must be transcribed and annotated, and these tasks are extremely labor intensive (Ćavar et al., 2016). Researchers have sought to accelerate these tasks with partial automation via forced alignment, natural language processing, and automatic speech recognition (ASR) (Neubig et al., 2020). Neural network—especially transformer-based—approaches have enabled large advances in ASR over the last decade. Models like XLSR-53 promise improved performance on under-resourced languages by leveraging massive data sets from many different languages (Conneau et al., 2020). This project extends these efforts to a novel context, applying …


‘A Category Of Their Own’: Quantitative Methods In The Use Of Pile-Sort Data In Perceptual Dialectology, Zachary Ty Gill Jan 2023

‘A Category Of Their Own’: Quantitative Methods In The Use Of Pile-Sort Data In Perceptual Dialectology, Zachary Ty Gill

Theses and Dissertations--Linguistics

The purpose of this study is to investigate how Mississippi Gulf Coast Creoles perceive language differences in their home area. A pile-sort task was carried out in which respondents were given stacks of cards with local communities written on them and instructed to stack together the regions where people “talk the same.” Once the piles were made, the fieldworker discussed their sortings with the respondents. The stacks were analyzed by means of a hierarchal agglomerative cluster analysis and non-parametric multidimensional scaling with k-means cluster analysis overlays to extract the perceived dialect areas. The groupings reveal that respondent strategies are based …


Data-Driven Neuroanatomical Subtypes In Various Stages Of Schizophrenia: Linking Cortical Thickness, Glutamate, And Language Functioning, Liangbing Liang Dec 2022

Data-Driven Neuroanatomical Subtypes In Various Stages Of Schizophrenia: Linking Cortical Thickness, Glutamate, And Language Functioning, Liangbing Liang

Electronic Thesis and Dissertation Repository

The considerable variation in the spatial distribution of cortical thickness changes has been used to parse heterogeneity in schizophrenia. We aimed to recover a ‘cortical impoverishment’ subgroup with widespread cortical thinning. We applied hierarchical cluster analysis to cortical thickness data of three datasets in different stages of psychosis and studied the cognitive, functional, neurochemical, language and symptom profiles of the observed subgroups. Our consensus-based clustering procedure consistently produced a subgroup characterized by significantly lower cortical thickness. This ‘cortical impoverishment’ subgroup was associated with a higher symptom burden in a clinically stable sample and higher glutamate levels with language impairments in …


Phonotactic Learning With Distributional Representations, Max A. Nelson Oct 2022

Phonotactic Learning With Distributional Representations, Max A. Nelson

Doctoral Dissertations

This dissertation explores the possibility that the phonological grammar manipulates phone representations based on learned distributional class memberships rather than those based on substantive linguistic features. In doing so, this work makes three primary contributions. First, I propose three novel algorithms for learning a phonological class system from the distributional statistics of a language, all of which are based on partitioning graph representations of phone distributions. Second, I propose a new method for fitting Maximum Entropy phonotactic grammars, MaxEntGrams, which offers theoretical complexity improvements over the widely-adopted approach taken by Hayes and Wilson [2008]. Third, I present a series of …


Restrictive Tier Induction, Seoyoung Kim Oct 2022

Restrictive Tier Induction, Seoyoung Kim

Doctoral Dissertations

This dissertation proposes the Restrictive Tier Learner, which automatically induces only the tiers that are absolutely necessary in capturing phonological long-distance dependencies. The core of my learner is the addition of an extra evaluation step to the existing Inductive Projection Learner (Gouskova and Gallagher 2020), where the necessity and accuracy of the candidate tiers are determined. An important building block of my learner is a typological observation, namely the dichotomy between trigram-bound and unbounded patterns. The fact that this dichotomy is attested in both consonant interactions and vowel interactions allows for a unified approach to be used. Another important piece …


From Sesame Street To Beyond: Multi-Domain Discourse Relation Classification With Pretrained Bert, Isaac R. Raff Sep 2022

From Sesame Street To Beyond: Multi-Domain Discourse Relation Classification With Pretrained Bert, Isaac R. Raff

Dissertations, Theses, and Capstone Projects

Research efforts in transfer learning have gained massive popularity in recent years. Pretrained language models have demonstrated the most successful results in producing high quality neural networks capable of quality inference after training across domains via transfer learning. This study expands on the domain transfer introduced in \cite{ferracane-etal-2019-news} exploring neural methods for transfer learning of discourse parsing between a news source domain and a medical target domain. \cite{ferracane-etal-2019-news} specifically discuss transfer learning from news articles to PubMed medical journal articles. Experiments in transfer learning in the current work expand to include three domains: Wall Street Journal articles previously annotated with …


Linguistic Abstractions In Children’S Very Early Utterances, Qihui Xu Sep 2022

Linguistic Abstractions In Children’S Very Early Utterances, Qihui Xu

Dissertations, Theses, and Capstone Projects

How early do children produce multiword utterances? Do children's early utterances reflect abstract syntactic knowledge or are they the result of data-driven learning? We examine this issue through corpus analysis, computational modeling, and adult simulation experiments. Chapter 1 investigates when children start producing multiword utterances; we use corpora to establish the development of multiword utterances and a probabilistic computational model to account for the quantitative change of early multiword utterances. We find that multiword utterances of different lengths appear early in acquisition and increase together, and the length growth pattern can be viewed as a probabilistic and dynamic process.

Chapter …


Towards Explaining Variation In Entrainment, Andreas Weise Sep 2022

Towards Explaining Variation In Entrainment, Andreas Weise

Dissertations, Theses, and Capstone Projects

Entrainment refers to the tendency of human speakers to adapt to their interlocutors to become more similar to them. This affects various dimensions and occurs in many contexts, allowing for rich applications in human-computer interaction. However, it is not exhibited by every speaker in every conversation but varies widely across features, speakers, and contexts, hindering broad application. This variation, whose guiding principles are poorly understood even after decades of entrainment research, is the subject of this thesis. We begin with a comprehensive literature review that serves as the foundation of our own work and provides a reference to guide future …


Predicting Stress In Russian Using Modern Machine-Learning Tools, John Schriner Sep 2022

Predicting Stress In Russian Using Modern Machine-Learning Tools, John Schriner

Dissertations, Theses, and Capstone Projects

In the Russian language, stress on a word is determined via often complex patterns and rules. In this paper, after examining nearly a century of research in stress rules and methods in Russian, we turn to see if modern machine learning tools can aid in predicting stress. Using A.A. Zaliznyak’s dictionary grammar and over 300,000 word forms, we derived stress codes to aid in predicting which syllable primary stress falls on. We trained an LSTM neural network on the data and conducted eight experiments with added features such as lemma, part of speech, and morphology. While the model performed better …


Corrective Feedback Timing In Kanji Writing Instruction Apps, Phoenix Mulgrew Jun 2022

Corrective Feedback Timing In Kanji Writing Instruction Apps, Phoenix Mulgrew

Honors Theses

The focus of this research paper is to determine the correct time to provide corrective feedback to people who are learning how to write Japanese kanji. To do this, we developed a system that is able to recognize Japanese kanji that is handwritten onto an iPad screen and check for errors such as wrong stroke order. Previous research has achieved success in developing similar systems, but this project is unique because the research question involves the timing of corrective feedback. In particular, we are looking at whether immediate or delayed corrective feedback results in better learning.


A Machine Learning Approach To Text-Based Sarcasm Detection, Lara I. Novic Jun 2022

A Machine Learning Approach To Text-Based Sarcasm Detection, Lara I. Novic

Dissertations, Theses, and Capstone Projects

Sarcasm and indirect language are commonplace for humans to produce and recognize but difficult for machines to detect. While artificial intelligence can accurately analyze sentiment and emotion in speech and text, it may struggle with insincere and sardonic content, although it is possible to train a machine to identify uttered and written sarcasm. This paper aims to detect sarcasm using logistic regression and a support vector machine (SVM) and compare their results to a baseline.

The models are trained on headlines from a Kaggle dataset containing headlines from the satirical news website The Onion and serious news website Huffpost (formerly …


Covert Determiners In Appalachian English Narrative Declarative Sentences, William Oliver Jun 2022

Covert Determiners In Appalachian English Narrative Declarative Sentences, William Oliver

Dissertations, Theses, and Capstone Projects

In this thesis, I explore the syntax and semantics of covert determiners (Ds) in matrix subject determiner phrases (DPs) with definite specific interpretations. To conduct my investigation, I used the Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE), a million-word Penn Treebank corpus, and the software CorpusSearch, a Java program that searches Penn Treebank corpora. My research shows that Appalachian English contains a linguistic phenomenon where speakers drop the D, replacing overt Ds with covert Ds, in definite specific DPs. For example, where Standard English speakers say The doctor came by horseback, Appalachian speakers may use a covert D …


Metaphor Detection In Poems In Misurata Arabic Sub-Dialect : An Lstm Model, Azza Abugharsa May 2022

Metaphor Detection In Poems In Misurata Arabic Sub-Dialect : An Lstm Model, Azza Abugharsa

Theses, Dissertations and Culminating Projects

Natural Language Processing (NLP) in Arabic is witnessing an increasing interest in investigating different topics in the field. One of the topics that have drawn attention is the automatic processing of Arabic figurative language. The focus in previous projects is on detecting and interpreting metaphors in comments from social media as well as phrases and/or headlines from news articles. The current project focuses on metaphor detection in poems written in the Misurata Arabic sub-dialect spoken in Misurata, located in the North African region. The dataset is initially annotated by a group of linguists, and their annotation is treated as the …