Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

Natural language processing

Discipline
Institution
Publication Year
Publication
Publication Type

Articles 1 - 12 of 12

Full-Text Articles in Computational Linguistics

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch Jun 2024

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch

Dissertations, Theses, and Capstone Projects

Written modern Hebrew presents a unique challenge for training computational models for language processing because modern Hebrew text often lacks vocalization. The lack of available vocalized Hebrew data can lead to ambiguity in training these models and generally hinders work on natural language processing problems. The goal of this project is to contribute to the collection of vocalized Hebrew text by collecting and preprocessing a large corpus of unvocalized Hebrew text and building an online annotation tool. The annotation tool allows people to upload unvocalized Hebrew text, to annotate by adding Hebrew vocalization, and to download comma-separated values files of …


Content-Based Unsupervised Fake News Detection On Ukraine-Russia War, Yucheol Shin, Yvan Sojdehei, Limin Zheng, Brad Blanchard Apr 2023

Content-Based Unsupervised Fake News Detection On Ukraine-Russia War, Yucheol Shin, Yvan Sojdehei, Limin Zheng, Brad Blanchard

SMU Data Science Review

The Ukrainian-Russian war has garnered significant attention worldwide, with fake news obstructing the formation of public opinion and disseminating false information. This scholarly paper explores the use of unsupervised learning methods and the Bidirectional Encoder Representations from Transformers (BERT) to detect fake news in news articles from various sources. BERT topic modeling is applied to cluster news articles by their respective topics, followed by summarization to measure the similarity scores. The hypothesis posits that topics with larger variances are more likely to contain fake news. The proposed method was evaluated using a dataset of approximately 1000 labeled news articles related …


Creating Data From Unstructured Text With Context Rule Assisted Machine Learning (Craml), Stephen Meisenbacher, Peter Norlander Dec 2022

Creating Data From Unstructured Text With Context Rule Assisted Machine Learning (Craml), Stephen Meisenbacher, Peter Norlander

School of Business: Faculty Publications and Other Works

Popular approaches to building data from unstructured text come with limitations, such as scalability, interpretability, replicability, and real-world applicability. These can be overcome with Context Rule Assisted Machine Learning (CRAML), a method and no-code suite of software tools that builds structured, labeled datasets which are accurate and reproducible. CRAML enables domain experts to access uncommon constructs within a document corpus in a low-resource, transparent, and flexible manner. CRAML produces document-level datasets for quantitative research and makes qualitative classification schemes scalable over large volumes of text. We demonstrate that the method is useful for bibliographic analysis, transparent analysis of proprietary data, …


Integrating Cultural Knowledge Into Artificially Intelligent Systems: Human Experiments And Computational Implementations, Anurag Acharya May 2022

Integrating Cultural Knowledge Into Artificially Intelligent Systems: Human Experiments And Computational Implementations, Anurag Acharya

FIU Electronic Theses and Dissertations

With the advancement of Artificial Intelligence, it seems as if every aspect of our lives is impacted by AI in one way or the other. As AI is used for everything from driving vehicles to criminal justice, it becomes crucial that it overcome any biases that might hinder its fair application. We are constantly trying to make AI be more like humans. But most AI systems so far fail to address one of the main aspects of humanity: our culture and the differences between cultures. We cannot truly consider AI to have understood human reasoning without understanding culture. So it …


Toward Suicidal Ideation Detection With Lexical Network Features And Machine Learning, Ulya Bayram, William Lee, Daniel Santel, Ali Minai, Peggy Clark, Tracy Glauser, John Pestian Apr 2022

Toward Suicidal Ideation Detection With Lexical Network Features And Machine Learning, Ulya Bayram, William Lee, Daniel Santel, Ali Minai, Peggy Clark, Tracy Glauser, John Pestian

Northeast Journal of Complex Systems (NEJCS)

In this study, we introduce a new network feature for detecting suicidal ideation from clinical texts and conduct various additional experiments to enrich the state of knowledge. We evaluate statistical features with and without stopwords, use lexical networks for feature extraction and classification, and compare the results with standard machine learning methods using a logistic classifier, a neural network, and a deep learning method. We utilize three text collections. The first two contain transcriptions of interviews conducted by experts with suicidal (n=161 patients that experienced severe ideation) and control subjects (n=153). The third collection consists of interviews conducted by experts …


Label Imputation For Homograph Disambiguation: Theoretical And Practical Approaches, Jennifer M. Seale Sep 2021

Label Imputation For Homograph Disambiguation: Theoretical And Practical Approaches, Jennifer M. Seale

Dissertations, Theses, and Capstone Projects

This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph …


Mitigating Gender Bias In Neural Machine Translation Using Counterfactual Data, Alan Wong Sep 2020

Mitigating Gender Bias In Neural Machine Translation Using Counterfactual Data, Alan Wong

Dissertations, Theses, and Capstone Projects

Recent advances in deep learning have greatly improved the ability of researchers to develop effective machine translation systems. In particular, the application of modern neural architectures, such as the Transformer, has achieved state-of-the-art BLEU scores in many translation tasks. However, it has been found that even state-of-the-art neural machine translation models can suffer from certain implicit biases, such as gender bias (Lu et al., 2019). In response to this issue, researchers have proposed various potential solutions: some have proposed approaches that inject missing gender information into models, while others have attempted modifying the training data itself. We focus on mitigating …


Does The Word "Chien" Bark? Representation Learning In Neural Machine Translation Encoders, Emily Campbell Sep 2020

Does The Word "Chien" Bark? Representation Learning In Neural Machine Translation Encoders, Emily Campbell

Dissertations, Theses, and Capstone Projects

This thesis presents experiments with using representation learning to explore how neural networks learn. Neural networks which take text as input create internal representations of the text during their training. Recent work has found that these representations can be used to perform other downstream linguistic tasks, such as part-of-speech (POS) tagging. This demonstrates that the neural networks are learning linguistic information and storing this information in the representations. We focus on the representations created by neural machine translation (NMT) models and whether they can be used in POS tagging. We train 5 NMT models including an auto-encoder. We extract the …


Application Of Boolean Logic To Natural Language Complexity In Political Discourse, Austin Taing Jan 2019

Application Of Boolean Logic To Natural Language Complexity In Political Discourse, Austin Taing

Theses and Dissertations--Computer Science

Press releases serve as a major influence on public opinion of a politician, since they are a primary means of communicating with the public and directing discussion. Thus, the public’s ability to digest them is an important factor for politicians to consider. This study employs several well-studied measures of linguistic complexity and proposes a new one to examine whether politicians change their language to become more or less difficult to parse in different situations. This study uses 27,500 press releases from the US Senate between 2004–2008 and examines election cycles and natural disasters, namely hurricanes, as situations where politicians’ language …


Cest: City Event Summarization Using Twitter, Deepa Mallela May 2016

Cest: City Event Summarization Using Twitter, Deepa Mallela

Computer Science Graduate Projects and Theses

Twitter, with 288 million active users, has become the most popular platform for continuous real-time discussions. This leads to huge amounts of information related to the real-world, which has attracted researchers from both academia and industry. Event detection on Twitter has gained attention as one of the most popular domains of interest within the research community. Unfortunately, existing event detection methodologies have yet to fully explore Twitter metadata and instead rely solely on identifying events based on prior information or focus on events that belong to specific categories. Given the heavy volume of tweets that discuss events, summarization techniques can …


Misheard Me Oronyminator: Using Oronyms To Validate The Correctness Of Frequency Dictionaries, Jennifer G. Hughes Jun 2013

Misheard Me Oronyminator: Using Oronyms To Validate The Correctness Of Frequency Dictionaries, Jennifer G. Hughes

Master's Theses

In the field of speech recognition, an algorithm must learn to tell the difference between "a nice rock" and "a gneiss rock". These identical-sounding phrases are called oronyms. Word frequency dictionaries are often used by speech recognition systems to help resolve phonetic sequences with more than one possible orthographic phrase interpretation, by looking up which oronym of the root phonetic sequence contains the most-common words.

Our paper demonstrates a technique used to validate word frequency dictionary values. We chose to use frequency values from the UNISYN dictionary, which tallies each word on a per-occurance basis, using a proprietary text corpus, …


Computational Style Processing, Foaad Khosmood Dec 2010

Computational Style Processing, Foaad Khosmood

Foaad Khosmood

Our main thesis is that computational processing of natural language styles can be accomplished using corpus analysis methods and language transformation rules. We demonstrate this first by statistically modeling natural language styles, and second by developing tools that carry out style processing, and finally by running experiments using the tools and evaluating the results. Specifically, we present a model for style in natural languages, and demonstrate style processing in three ways: Our system analyzes styles in quantifiable terms according to our model (analysis), associates documents based on stylistic similarity to known corpora (classification) and manipulates texts to match a desired …