Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

Discipline
Institution
Keyword
Publication Year
Publication
Publication Type
File Type

Articles 1 - 30 of 229

Full-Text Articles in Computational Linguistics

Skyler's Lunch, Noah Sherman, Autumn Boone, Hilaria Cruz Apr 2024

Skyler's Lunch, Noah Sherman, Autumn Boone, Hilaria Cruz

LING 590/Internet Language

Our class was studying the use of emojis across different platforms and wanted to explore how stories using emojis could impact young readers. Here, we try to translate the story of Skyler into emoji, providing translations along the way. We replace words completely with emoji, represent phrases with a few emoji, and use additional emoji to make sense of the content, including punctuation. In this book, we explore the character of Skyler, who is a picky eater. But they learn to eat the nutritious food that is good for them. In the end, they even get a reward!


Retórica Intercultural En El Discurso Académico Universitario: Las Funciones Retóricas De La Citación En Los Trabajos De Fin De Máster Escritos En Español Y En Inglés Por Hablantes Nativos Y No Nativos, David Sanchez-Jimenez Feb 2024

Retórica Intercultural En El Discurso Académico Universitario: Las Funciones Retóricas De La Citación En Los Trabajos De Fin De Máster Escritos En Español Y En Inglés Por Hablantes Nativos Y No Nativos, David Sanchez-Jimenez

Publications and Research

This research derives from the interest in learning the cultural differences in citation practices in the academic genre of Master's thesis of native Spanish (Ee), non-native Filipino writers of Spanish (Fe), native Filipino writers of English (Fi), and American writers of English. A total of thirty-two (32) master´s theses – eight (8) for each group – were analyzed. A quantitative and qualitative methodology was used to study this phenomenon based on the computerized textual analysis of the rhetorical function of citations arranged in typological classification that modified the outline proposed by Petrić in his 2007 article. The results obtained from …


How Do We Learn What We Cannot Say?, Daniel Yakubov Feb 2024

How Do We Learn What We Cannot Say?, Daniel Yakubov

Dissertations, Theses, and Capstone Projects

The contributions of this thesis are two-fold. First, this thesis presents UDTube, an easily usable software developed to perform morphological analysis in a multi-task fashion. This work shows the strong performance of UDTube versus the current state-of-the-art, UDPipe, across eight languages, primarily in the annotation of morphological features. The second contribution of this thesis is a exploration into the study of defectivity. UDTube is used to annotate a large amount of data in Greek and Russian which is ultimately used to investigate the plausibility of Indirect Negative Evidence (INE), a popular approach to the acquisition of morphological defectivity. The reported …


Consonant (De)Gradation In Ingrian?, Andrea M. Harrison Feb 2024

Consonant (De)Gradation In Ingrian?, Andrea M. Harrison

Dissertations, Theses, and Capstone Projects

This paper will present a dual method toward data enrichment for low-resource languages. Using Yoyodyne -- a Fairseq-inspired neural library for small-vocabulary sequence-to-sequence generation -- a morphological generation task was tested across labeled data encompassing multiple stages of enrichment for the low-resource language Ingrian. Due to limitations in the available data for Ingrian, weighted finite-state transducers (WFSTs) were used to generate an expanded vocabulary via HFST's toolkit for Uralic languages, and GiellaLT, a source for FST-driven lexica for low-resource languages. Further stages of experimentation used labeled data from related, higher-resource languages (Finnish, Estonian) to encourage cross-lingual transfer in the interest …


The Ring Cycle: Journeying Through The Language Of Tolkien’S Third Age With Corpus Linguistics, Michael Livesey Jan 2024

The Ring Cycle: Journeying Through The Language Of Tolkien’S Third Age With Corpus Linguistics, Michael Livesey

Journal of Tolkien Research

This article explores the journey taken by the One Ring across J.R.R. Tolkien’s Third Age writings. It employs a digital humanities approach to analyse linguistic patterns in Tolkien’s use of the word ring, across The Hobbit and The Lord of the Rings. Specifically, the article employs corpus linguistic methods to track shifts in the quantities and qualities of the Ring’s appearance across these texts. It uses techniques of keyness and collocation analysis to trace transformations in these quantities/qualities, including: a) the Ring’s transition from a central to a peripheral place in the Third Age’s narrative arc; and b) …


Guilty Machines: On Ab-Sens In The Age Of Ai, Dylan Lackey, Katherine Weinschenk Dec 2023

Guilty Machines: On Ab-Sens In The Age Of Ai, Dylan Lackey, Katherine Weinschenk

Critical Humanities

For Lacan, guilt arises in the sublimation of ab-sens (non-sense) into the symbolic comprehension of sen-absexe (sense without sex, sense in the deficiency of sexual relation), or in the maturation of language to sensibility through the effacement of sex. Though, as Slavoj Žižek himself points out in a recent article regarding ChatGPT, the split subject always misapprehends the true reason for guilt’s manifestation, such guilt at best provides a sort of evidence for the inclusion of the subject in the order of language, acting as a necessary, even enjoyable mark of the subject’s coherence (or, more importantly, the subject’s separation …


The Near-Synonymous Classifiers In Mandarin Chinese: Etymology, Modern Usage, And Possible Problems In L2 Classroom, Irina Kavokina Nov 2023

The Near-Synonymous Classifiers In Mandarin Chinese: Etymology, Modern Usage, And Possible Problems In L2 Classroom, Irina Kavokina

Masters Theses

Many Chinese classifiers are nearly synonymic – they can be used with the same head nouns without changing the meaning of the sentence, in other words, such classifiers can be used interchangeably or almost interchangeably. This poses a challenge for Chinese language learners, especially those who lack such a grammatical category in their own native language. Another complication arises from the ambiguous English translations of many classifiers.

In this paper we investigate the collocation behavior of near-synonymous Chinese classifiers, focusing on their semantic nuances and interchangeability. Analyzing 6 pairs of classifiers — 栋 and 幢, 匹 and 头, 批 and …


Executive Order On The Safe, Secure, And Trustworthy Development And Use Of Artificial Intelligence, Joseph R. Biden Oct 2023

Executive Order On The Safe, Secure, And Trustworthy Development And Use Of Artificial Intelligence, Joseph R. Biden

Copyright, Fair Use, Scholarly Communication, etc.

Section 1. Purpose. Artificial intelligence (AI) holds extraordinary potential for both promise and peril. Responsible AI use has the potential to help solve urgent challenges while making our world more prosperous, productive, innovative, and secure. At the same time, irresponsible use could exacerbate societal harms such as fraud, discrimination, bias, and disinformation; displace and disempower workers; stifle competition; and pose risks to national security. Harnessing AI for good and realizing its myriad benefits requires mitigating its substantial risks. This endeavor demands a society-wide effort that includes government, the private sector, academia, and civil society.

My Administration places the highest urgency …


Towards Interpretable Machine Reading Comprehension With Mixed Effects Regression And Exploratory Prompt Analysis, Luca Del Signore Sep 2023

Towards Interpretable Machine Reading Comprehension With Mixed Effects Regression And Exploratory Prompt Analysis, Luca Del Signore

Dissertations, Theses, and Capstone Projects

We investigate the properties of natural language prompts that determine their difficulty in machine reading comprehension tasks. While much work has been done benchmarking language model performance at the task level, there is considerably less literature focused on how individual task items can contribute to interpretable evaluations of natural language understanding. Such work is essential to deepening our understanding of language models and ensuring their responsible use as a key tool in human machine communication. We perform an in depth mixed effects analysis on the behavior of three major generative language models, comparing their performance on a large reading comprehension …


A Computational Analysis Of Volodymyr Zelenskyy's Public Diplomacy Discourse In Times Of Crisis, Amber Brittain-Hale Jul 2023

A Computational Analysis Of Volodymyr Zelenskyy's Public Diplomacy Discourse In Times Of Crisis, Amber Brittain-Hale

Education Division Scholarship

In this study, we delve into the public diplomacy discourse of Ukrainian President Volodymyr Zelenskyy during the ongoing crisis of the Russo-Ukrainian War. We aim to conduct a computational analysis of Zelenskyy's English, Russian, and Ukrainian speeches, exploring the linguistic patterns and code-switching employed in his discourse. The study period encompasses Russia’s build-up to and full-scale invasion of Ukraine from May 2019 to May 30, 2023. This time frame is crucial as it captures the dynamic development of the crisis and the expansion of Zelenskyy's presidency, providing a unique context for analyzing his public diplomacy efforts. By utilizing Linguistic Inquiry …


Ideology Prediction From Scarce And Biased Supervision: Learn To Disregard The “What” And Focus On The “How”!, Chen Chen, Dylan Walker, Venkatesh Saligrama Jul 2023

Ideology Prediction From Scarce And Biased Supervision: Learn To Disregard The “What” And Focus On The “How”!, Chen Chen, Dylan Walker, Venkatesh Saligrama

Business Faculty Articles and Research

We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral context vector independent of ideology, and a latent position vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents …


Destined Failure, Chengjun Pan Jun 2023

Destined Failure, Chengjun Pan

Masters Theses

I attempt to examine the complex structure of human communication, explaining why it is bound to fail. By reproducing experienceable phenomena, I demonstrate how they can expose communication structure and reveal the limitations of our perception and symbolization.I divide the process of communication into six stages: input, detection, symbolization, dictionary, interpretation, and output. In this thesis, I examine the flaws and challenges that arise in the first five stages. I argue that reception acts as a filter and that understanding relies on a symbolic system that is full of redundancies. Therefore, every interpretation is destined to be a deviation.


The Sociolinguistics Of Code-Switching In Hong Kong’S Digital Landscape: A Mixed-Methods Exploration Of Cantonese-English Alternation Patterns On Whatsapp, Wilkinson Daniel Wong Gonzales, Yuen Man Tsang Jun 2023

The Sociolinguistics Of Code-Switching In Hong Kong’S Digital Landscape: A Mixed-Methods Exploration Of Cantonese-English Alternation Patterns On Whatsapp, Wilkinson Daniel Wong Gonzales, Yuen Man Tsang

Journal of English and Applied Linguistics

This paper examines the prevalence of Cantonese-English code-mixing in Hong Kong through an under-researched digital medium. Prior research on this code-alternation practice has often been limited to exploring either the social or linguistic constraints of code-switching in spoken or written communication. Our study takes a holistic approach to analyzing code-switching in a hybrid medium that exhibits features of both spoken and written discourse. We specifically analyze the code-switching patterns of 24 undergraduates from a Hong Kong university on WhatsApp and examine how both social and linguistic factors potentially constrain these patterns. Utilizing a self-compiled sociolinguistic corpus as well as survey …


Topics For He But Not For She: Quantifying And Classifying Gender Bias In The Media, Tyler J. Lanni Jun 2023

Topics For He But Not For She: Quantifying And Classifying Gender Bias In The Media, Tyler J. Lanni

Dissertations, Theses, and Capstone Projects

In this study, we used computational techniques to analyze the language used in news articles to describe female and male politicians. Our corpus included 370 subtexts for male candidates and 374 subtexts for female candidates, gathered through the New York Times API. We conducted two experiments: an LDA topic analysis to explore the data, and a logistic regression to classify the subtexts as either male or female. Our analysis revealed some noteworthy findings that suggest the possibility of developing a gender bias classifier in the future. However, to create a more robust understanding of bias, additional research and data are …


Evaluating Neural Networks As Cognitive Models For Learning Quasi-Regularities In Language, Xiaomeng Ma Jun 2023

Evaluating Neural Networks As Cognitive Models For Learning Quasi-Regularities In Language, Xiaomeng Ma

Dissertations, Theses, and Capstone Projects

Many aspects of language can be categorized as quasi-regular: the relationship between the inputs and outputs is systematic but allows many exceptions. Common domains that contain quasi-regularity include morphological inflection and grapheme-phoneme mapping. How humans process quasi-regularity has been debated for decades. This thesis implemented modern neural network models, transformer models, on two tasks: English past tense inflection and Chinese character naming, to investigate how transformer models perform quasi-regularity tasks. This thesis focuses on investigating to what extent the models' performances can represent human behavior. The results show that the transformers' performance is very similar to human behavior in many …


Neural Network Vs. Rule-Based G2p: A Hybrid Approach To Stress Prediction And Related Vowel Reduction In Bulgarian, Maria Karamihaylova Jun 2023

Neural Network Vs. Rule-Based G2p: A Hybrid Approach To Stress Prediction And Related Vowel Reduction In Bulgarian, Maria Karamihaylova

Dissertations, Theses, and Capstone Projects

An effective grapheme-to-phoneme (G2P) conversion system is a critical element of speech synthesis. Rule-based systems were an early method for G2P conversion. In recent years, machine learning tools have been shown to outperform rule-based approaches in G2P tasks. We investigate neural network sequence-to-sequence modeling for the prediction of syllable stress and resulting vowel reductions in the Bulgarian language. We then develop a hybrid G2P approach which combines manually written grapheme-to-phoneme mapping rules with neural network-enabled syllable stress predictions by inserting stress markers in the predicted stress position of the transcription produced by the rule-based finite-state transducer. Finally, we apply vowel …


Ai Approaches To Understand Human Deceptions, Perceptions, And Perspectives In Social Media, Chih-Yuan Li May 2023

Ai Approaches To Understand Human Deceptions, Perceptions, And Perspectives In Social Media, Chih-Yuan Li

Dissertations

Social media platforms have created virtual space for sharing user generated information, connecting, and interacting among users. However, there are research and societal challenges: 1) The users are generating and sharing the disinformation 2) It is difficult to understand citizens' perceptions or opinions expressed on wide variety of topics; and 3) There are overloaded information and echo chamber problems without overall understanding of the different perspectives taken by different people or groups.

This dissertation addresses these three research challenges with advanced AI and Machine Learning approaches. To address the fake news, as deceptions on the facts, this dissertation presents Machine …


Predicting High-Cap Tech Stock Polarity: A Combined Approach Using Support Vector Machines And Bidirectional Encoders From Transformers, Ian L. Grisham May 2023

Predicting High-Cap Tech Stock Polarity: A Combined Approach Using Support Vector Machines And Bidirectional Encoders From Transformers, Ian L. Grisham

Electronic Theses and Dissertations

The abundance, accessibility, and scale of data have engendered an era where machine learning can quickly and accurately solve complex problems, identify complicated patterns, and uncover intricate trends. One research area where many have applied these techniques is the stock market. Yet, financial domains are influenced by many factors and are notoriously difficult to predict due to their volatile and multivariate behavior. However, the literature indicates that public sentiment data may exhibit significant predictive qualities and improve a model’s ability to predict intricate trends. In this study, momentum SVM classification accuracy was compared between datasets that did and did not …


Improving Sign Recognition With Phonology, Lee Kezar, Jesse Thomason, Zed Sevcikova Sehyr May 2023

Improving Sign Recognition With Phonology, Lee Kezar, Jesse Thomason, Zed Sevcikova Sehyr

Communication Sciences and Disorders Faculty Articles and Research

We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition …


Content-Based Unsupervised Fake News Detection On Ukraine-Russia War, Yucheol Shin, Yvan Sojdehei, Limin Zheng, Brad Blanchard Apr 2023

Content-Based Unsupervised Fake News Detection On Ukraine-Russia War, Yucheol Shin, Yvan Sojdehei, Limin Zheng, Brad Blanchard

SMU Data Science Review

The Ukrainian-Russian war has garnered significant attention worldwide, with fake news obstructing the formation of public opinion and disseminating false information. This scholarly paper explores the use of unsupervised learning methods and the Bidirectional Encoder Representations from Transformers (BERT) to detect fake news in news articles from various sources. BERT topic modeling is applied to cluster news articles by their respective topics, followed by summarization to measure the similarity scores. The hypothesis posits that topics with larger variances are more likely to contain fake news. The proposed method was evaluated using a dataset of approximately 1000 labeled news articles related …


Single-Case Pilot Study For Longitudinal Analysis Of Referential Failures And Sentiment In Schizophrenic Speech From Client-Centered Psychotherapy Recordings, Travis A. Musich Apr 2023

Single-Case Pilot Study For Longitudinal Analysis Of Referential Failures And Sentiment In Schizophrenic Speech From Client-Centered Psychotherapy Recordings, Travis A. Musich

Dissertations

Though computational linguistic analyses have revealed the presence of distinctly characteristic language features in schizophrenic disordered speech, the relative stability of these language features in longitudinal samples is still unknown. This longitudinal pilot study analyzed schizophrenic disordered speech data from the archival therapy audio recordings of one patient spanning 23 years. End-to-end Neural Coreference Resolution software was used to analyze transcribed speech data from three therapy sessions to identify ambiguous pronouns, referred to as referential failures, which were reviewed and confirmed by multiple raters. Speech samples were analyzed using Google Cloud Natural Language API software for sentiment variables (i.e., score, …


Chatgpt As Metamorphosis Designer For The Future Of Artificial Intelligence (Ai): A Conceptual Investigation, Amarjit Kumar Singh (Library Assistant), Dr. Pankaj Mathur (Deputy Librarian) Mar 2023

Chatgpt As Metamorphosis Designer For The Future Of Artificial Intelligence (Ai): A Conceptual Investigation, Amarjit Kumar Singh (Library Assistant), Dr. Pankaj Mathur (Deputy Librarian)

Library Philosophy and Practice (e-journal)

Abstract

Purpose: The purpose of this research paper is to explore ChatGPT’s potential as an innovative designer tool for the future development of artificial intelligence. Specifically, this conceptual investigation aims to analyze ChatGPT’s capabilities as a tool for designing and developing near about human intelligent systems for futuristic used and developed in the field of Artificial Intelligence (AI). Also with the helps of this paper, researchers are analyzed the strengths and weaknesses of ChatGPT as a tool, and identify possible areas for improvement in its development and implementation. This investigation focused on the various features and functions of ChatGPT that …


A Sentiment Analysis Of "Filipinx" On Twitter Using A Multinomial Naïve Bayes Classification Model, Clarisse Taboy Feb 2023

A Sentiment Analysis Of "Filipinx" On Twitter Using A Multinomial Naïve Bayes Classification Model, Clarisse Taboy

Dissertations, Theses, and Capstone Projects

On social media, the use of “Filipinx” as a gender neutral, inclusive term for “Filipino” tends to generate high user engagement, at times without regard for the original context in which the word appears. This project applies computational methods to collect a large dataset in English/Filipino from Twitter containing “Filipinx”, and to train a Naïve Bayes model to classify tweets into three sentiments: positive, neutral, and negative. My methodology takes inspiration from that of four related studies that similarly conducted sentiment analysis on English/Filipino tweets involving various topics, and whose resulting accuracy scores were compared side-by-side. Conducting sentiment analysis on …


Simulating The Machine Translation Of Low-Resource Languages By Designing A Translator Between English And An Artificially Constructed Language, Michaela Snyder Jan 2023

Simulating The Machine Translation Of Low-Resource Languages By Designing A Translator Between English And An Artificially Constructed Language, Michaela Snyder

Mahurin Honors College Capstone Experience/Thesis Projects

Natural language processing (NLP), or the use of computers to analyze natural language, is a field that relies heavily on syntax. It would seem intuitive that computers would thrive in this area due to their strict syntax requirements, but the syntax of natural languages leaves them unable to properly parse and generate sentences that seem normal to the average speaker. A subfield of NLP, machine translation, works mainly to computerize translation between different languages. Unfortunately, such translation is not without its weaknesses; language documentation is not created equal, and many low-resource languages—languages with relatively few kinds of documentation, most often …


‘A Category Of Their Own’: Quantitative Methods In The Use Of Pile-Sort Data In Perceptual Dialectology, Zachary Ty Gill Jan 2023

‘A Category Of Their Own’: Quantitative Methods In The Use Of Pile-Sort Data In Perceptual Dialectology, Zachary Ty Gill

Theses and Dissertations--Linguistics

The purpose of this study is to investigate how Mississippi Gulf Coast Creoles perceive language differences in their home area. A pile-sort task was carried out in which respondents were given stacks of cards with local communities written on them and instructed to stack together the regions where people “talk the same.” Once the piles were made, the fieldworker discussed their sortings with the respondents. The stacks were analyzed by means of a hierarchal agglomerative cluster analysis and non-parametric multidimensional scaling with k-means cluster analysis overlays to extract the perceived dialect areas. The groupings reveal that respondent strategies are based …


Automatic Transcription Of Northern Prinmi Oral Art: Approaches And Challenges To Automatic Speech Recognition For Language Documentation, Connor Bechler Jan 2023

Automatic Transcription Of Northern Prinmi Oral Art: Approaches And Challenges To Automatic Speech Recognition For Language Documentation, Connor Bechler

Theses and Dissertations--Linguistics

One significant issue facing language documentation efforts is the transcription bottleneck: each documented recording must be transcribed and annotated, and these tasks are extremely labor intensive (Ćavar et al., 2016). Researchers have sought to accelerate these tasks with partial automation via forced alignment, natural language processing, and automatic speech recognition (ASR) (Neubig et al., 2020). Neural network—especially transformer-based—approaches have enabled large advances in ASR over the last decade. Models like XLSR-53 promise improved performance on under-resourced languages by leveraging massive data sets from many different languages (Conneau et al., 2020). This project extends these efforts to a novel context, applying …


Evaluation Of Different Machine Learning, Deep Learning And Text Processing Techniques For Hate Speech Detection, Nabil Shawkat Jan 2023

Evaluation Of Different Machine Learning, Deep Learning And Text Processing Techniques For Hate Speech Detection, Nabil Shawkat

MSU Graduate Theses

Social media has become a domain that involves a lot of hate speech. Some users feel entitled to engage in abusive conversations by sending abusive messages, tweets, or photos to other users. It is critical to detect hate speech and prevent innocent users from becoming victims. In this study, I explore the effectiveness and performance of various machine learning methods employing text processing techniques to create a robust system for hate speech identification. I assess the performance of Naïve Bayes, Support Vector Machines, Decision Trees, Random Forests, Logistic Regression, and K Nearest Neighbors using three distinct datasets sourced from social …


Brazilian Portuguese-Russian (Braporus) Corpus: Automatic Transcription And Acoustic Quality Of Elderly Speech During Covid-19 Pandemic, Irina A. Sekerina, Anna Smirnova Henriques, Aleksandra Skorobogatova, Natalia Tyulina, Tatiana V. Kachkovskaia, Svetlana Ruseishvili, Sandra Madureira Jan 2023

Brazilian Portuguese-Russian (Braporus) Corpus: Automatic Transcription And Acoustic Quality Of Elderly Speech During Covid-19 Pandemic, Irina A. Sekerina, Anna Smirnova Henriques, Aleksandra Skorobogatova, Natalia Tyulina, Tatiana V. Kachkovskaia, Svetlana Ruseishvili, Sandra Madureira

Publications and Research

This article presents the Brazilian Portuguese-Russian (BraPoRus) corpus, whose goal is to collect, analyze, and preserve for posterity the spoken heritage Russian still used today in Brazil by approximately 1,500 elderly bilingual heritage Russian–Brazilian Portuguese speakers. Their unique 100-year-old variety of moribund Russian is disappearing because it has not been passed to their descendants born in Brazil. During the COVID-19 pandemic, we remotely collected 170 h of speech samples in heritage Russian from 26 participants (Mage = 75.7 years) in naturalistic settings using Zoom or a phone call. To estimate the quality of collected data, we focus on two methodological …


Technology In The Classroom: The Features Language Teachers Should Consider, Sophie Cuocci, Padideh Fattahi Marnani Dec 2022

Technology In The Classroom: The Features Language Teachers Should Consider, Sophie Cuocci, Padideh Fattahi Marnani

Journal of English Learner Education

The fast development of technology and the new generation of highly computer literate students led to consider the integration of technology in school as essential. Throughout the last two decades, research has identified multiple factors leading to the successful and unsuccessful integration of technology in the classroom. Educators must consider these factors when deciding on which technology tools to use and how to integrate them to their lessons. Simultaneously, the increasing number of English learners in the United States calls for the identification of teaching strategies that will best support their needs. Many language teachers now rely on teaching techniques …


Creating Data From Unstructured Text With Context Rule Assisted Machine Learning (Craml), Stephen Meisenbacher, Peter Norlander Dec 2022

Creating Data From Unstructured Text With Context Rule Assisted Machine Learning (Craml), Stephen Meisenbacher, Peter Norlander

School of Business: Faculty Publications and Other Works

Popular approaches to building data from unstructured text come with limitations, such as scalability, interpretability, replicability, and real-world applicability. These can be overcome with Context Rule Assisted Machine Learning (CRAML), a method and no-code suite of software tools that builds structured, labeled datasets which are accurate and reproducible. CRAML enables domain experts to access uncommon constructs within a document corpus in a low-resource, transparent, and flexible manner. CRAML produces document-level datasets for quantitative research and makes qualitative classification schemes scalable over large volumes of text. We demonstrate that the method is useful for bibliographic analysis, transparent analysis of proprietary data, …