Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Linguistics

City University of New York (CUNY)

Theses/Dissertations

Natural language processing

Publication Year

Articles 1 - 5 of 5

Full-Text Articles in Entire DC Network

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch Jun 2024

Expanding The Corpus Of Vocalized Hebrew Text: Compiling An Unvocalized Text Corpus And Building An Online Interface For Vocalization Annotation, Rachel Shanblatt Bloch

Dissertations, Theses, and Capstone Projects

Written modern Hebrew presents a unique challenge for training computational models for language processing because modern Hebrew text often lacks vocalization. The lack of available vocalized Hebrew data can lead to ambiguity in training these models and generally hinders work on natural language processing problems. The goal of this project is to contribute to the collection of vocalized Hebrew text by collecting and preprocessing a large corpus of unvocalized Hebrew text and building an online annotation tool. The annotation tool allows people to upload unvocalized Hebrew text, to annotate by adding Hebrew vocalization, and to download comma-separated values files of …


Label Imputation For Homograph Disambiguation: Theoretical And Practical Approaches, Jennifer M. Seale Sep 2021

Label Imputation For Homograph Disambiguation: Theoretical And Practical Approaches, Jennifer M. Seale

Dissertations, Theses, and Capstone Projects

This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph …


Mitigating Gender Bias In Neural Machine Translation Using Counterfactual Data, Alan Wong Sep 2020

Mitigating Gender Bias In Neural Machine Translation Using Counterfactual Data, Alan Wong

Dissertations, Theses, and Capstone Projects

Recent advances in deep learning have greatly improved the ability of researchers to develop effective machine translation systems. In particular, the application of modern neural architectures, such as the Transformer, has achieved state-of-the-art BLEU scores in many translation tasks. However, it has been found that even state-of-the-art neural machine translation models can suffer from certain implicit biases, such as gender bias (Lu et al., 2019). In response to this issue, researchers have proposed various potential solutions: some have proposed approaches that inject missing gender information into models, while others have attempted modifying the training data itself. We focus on mitigating …


Does The Word "Chien" Bark? Representation Learning In Neural Machine Translation Encoders, Emily Campbell Sep 2020

Does The Word "Chien" Bark? Representation Learning In Neural Machine Translation Encoders, Emily Campbell

Dissertations, Theses, and Capstone Projects

This thesis presents experiments with using representation learning to explore how neural networks learn. Neural networks which take text as input create internal representations of the text during their training. Recent work has found that these representations can be used to perform other downstream linguistic tasks, such as part-of-speech (POS) tagging. This demonstrates that the neural networks are learning linguistic information and storing this information in the representations. We focus on the representations created by neural machine translation (NMT) models and whether they can be used in POS tagging. We train 5 NMT models including an auto-encoder. We extract the …


Systematic Comparison Of Cross-Lingual Projection Techniques For Low-Density Nlp Under Strict Resource Constraints, Joshua Waxman Oct 2014

Systematic Comparison Of Cross-Lingual Projection Techniques For Low-Density Nlp Under Strict Resource Constraints, Joshua Waxman

Dissertations, Theses, and Capstone Projects

The field of low-density NLP is often approached from an engineering perspective, and evaluations are typically haphazard - considering different architectures, given different languages, and different available resources - without a systematic comparison. The resulting architectures are then tested on the unique corpus and language for which this approach has been designed. This makes it difficult to truly evaluate which approach is truly the "best," or which approaches are best for a given language.

In this dissertation, several state-of-the-art architectures and approaches to low-density language Part-Of-Speech Tagging are reimplemented; all of these techniques exploit a relationship between a high-density (HD) …