Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 17 of 17

Full-Text Articles in Computer Sciences

Iitd At The Wanlp 2022 Shared Task: Multilingual Multi-Granularity Network For Propaganda Detection, Shubham Mittal, Preslav Nakov Dec 2022

Iitd At The Wanlp 2022 Shared Task: Multilingual Multi-Granularity Network For Propaganda Detection, Shubham Mittal, Preslav Nakov

Natural Language Processing Faculty Publications

We present our system for the two subtasks of the shared task on propaganda detection in Arabic, part of WANLP'2022. Subtask 1 is a multi-label classification problem to find the propaganda techniques used in a given tweet. Our system for this task uses XLM-R to predict probabilities for the target tweet to use each of the techniques. In addition to finding the techniques, Subtask 2 further asks to identify the textual span for each instance of each technique that is present in the tweet; the task can be modeled as a sequence tagging problem. We use a multi-granularity network with …


Predicting Publication Of Clinical Trials Using Structured And Unstructured Data: Model Development And Validation Study, Siyang Wang, Simon Šuster, Timothy Baldwin, Karin Verspoor Dec 2022

Predicting Publication Of Clinical Trials Using Structured And Unstructured Data: Model Development And Validation Study, Siyang Wang, Simon Šuster, Timothy Baldwin, Karin Verspoor

Natural Language Processing Faculty Publications

Background: Publication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack …


Assisting The Human Fact-Checkers: Detecting All Previously Fact-Checked Claims In A Document, Shaden Shaar, Nikola Georgiev, Firoj Alam, Giovanni Da San Martino, Aisha Mohamed, Preslav Nakov Dec 2022

Assisting The Human Fact-Checkers: Detecting All Previously Fact-Checked Claims In A Document, Shaden Shaar, Nikola Georgiev, Firoj Alam, Giovanni Da San Martino, Aisha Mohamed, Preslav Nakov

Natural Language Processing Faculty Publications

Given the recent proliferation of false claims online, there has been a lot of manual fact-checking effort. As this is very time-consuming, human fact-checkers can benefit from tools that can support them and make them more efficient. Here, we focus on building a system that could provide such support. Given an input document, it aims to detect all sentences that contain a claim that can be verified by some previously fact-checked claims (from a given database). The output is a re-ranked list of the document sentences, so that those that can be verified are ranked as high as possible, together …


Pasta: Table-Operations Aware Fact Verification Via Sentence-Table Cloze Pre-Training, Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, Xiaoyong Du Dec 2022

Pasta: Table-Operations Aware Fact Verification Via Sentence-Table Cloze Pre-Training, Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, Xiaoyong Du

Natural Language Processing Faculty Publications

Fact verification has attracted a lot of research attention recently, e.g., in journalism, marketing, and policymaking, as misinformation and disinformation online can sway one's opinion and affect one's actions. While fact-checking is a hard task in general, in many cases, false statements can be easily debunked based on analytics over tables with reliable information. Hence, table-based fact verification has recently emerged as an important and growing research area. Yet, progress has been limited due to the lack of datasets that can be used to pre-train language models (LMs) to be aware of common table operations, such as aggregating a column …


Overview Of The Wanlp 2022 Shared Task On Propaganda Detection In Arabic, Firoj Alam, Hamdy Mubarak, Wajdi Zaghouani, Giovanni Da San Martino, Preslav Nakov Dec 2022

Overview Of The Wanlp 2022 Shared Task On Propaganda Detection In Arabic, Firoj Alam, Hamdy Mubarak, Wajdi Zaghouani, Giovanni Da San Martino, Preslav Nakov

Natural Language Processing Faculty Publications

Propaganda is the expression of an opinion or an action by an individual or a group deliberately designed to influence the opinions or the actions of other individuals or groups with reference to predetermined ends, which is achieved by means of well-defined rhetorical and psychological devices. Propaganda techniques are commonly used in social media to manipulate or to mislead users. Thus, there has been a lot of recent research on automatic detection of propaganda techniques in text as well as in memes. However, so far the focus has been primarily on English. With the aim to bridge this language gap, …


Supervised Acoustic Embeddings And Their Transferability Across Languages, Sreepratha Ram, Hanan Aldarmaki Dec 2022

Supervised Acoustic Embeddings And Their Transferability Across Languages, Sreepratha Ram, Hanan Aldarmaki

Natural Language Processing Faculty Publications

In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise, which is challenging in low-resource settings. Self-supervised pretraining has been proposed as a way to improve both supervised and unsupervised speech recognition, including frame-level feature representations and Acoustic Word Embeddings (AWE) for variable-length segments. However, self-supervised models alone cannot learn perfect separation of the linguistic content as they are trained to optimize indirect objectives. In this work, we experiment with different pre-trained self-supervised features as input to AWE models and show that they work best …


Greener: Graph Neural Networks For News Media Profiling, Panayot Panayotov, Utsav Shukla, Husrev T. Sencar, Mohamed Nabeel, Preslav Nakov Dec 2022

Greener: Graph Neural Networks For News Media Profiling, Panayot Panayotov, Utsav Shukla, Husrev T. Sencar, Mohamed Nabeel, Preslav Nakov

Natural Language Processing Faculty Publications

We study the problem of profiling news media on the Web with respect to their factuality of reporting and bias. This is an important but under-studied problem related to disinformation and “fake news” detection, but it addresses the issue at a coarser granularity compared to looking at an individual article or an individual claim. This is useful as it allows to profile entire media outlets in advance. Unlike previous work, which has focused primarily on text (e.g., on the articles published by the target website, or on the textual description in their social media profiles or in Wikipedia), here we …


A Survey On Multimodal Disinformation Detection, Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fabrizio Silvestri, Dimitar Dimitrov, Giovanni Da San Martino, Shaden Shaar, Hamed Firooz, Preslav Nakov Oct 2022

A Survey On Multimodal Disinformation Detection, Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fabrizio Silvestri, Dimitar Dimitrov, Giovanni Da San Martino, Shaden Shaar, Hamed Firooz, Preslav Nakov

Natural Language Processing Faculty Publications

Recent years have witnessed the proliferation of offensive content online such as fake news, propaganda, misinformation, and disinformation. While initially this was mostly about textual content, over time images and videos gained popularity, as they are much easier to consume, attract more attention, and spread further than text. As a result, researchers started leveraging different modalities and combinations thereof to tackle online multimodal offensive content. In this study, we offer a survey on the state-of-the-art on multimodal disinformation detection covering various combinations of modalities: text, images, speech, video, social media network structure, and temporal information. Moreover, while some studies focused …


Noisy Label Regularisation For Textual Regression, Yuxia Wang, Timothy Baldwin, Karin Verspoor Oct 2022

Noisy Label Regularisation For Textual Regression, Yuxia Wang, Timothy Baldwin, Karin Verspoor

Natural Language Processing Faculty Publications

Training with noisy labelled data is known to be detrimental to model performance, especially for high-capacity neural network models in low-resource domains. Our experiments suggest that standard regularisation strategies, such as weight decay and dropout, are ineffective in the face of noisy labels. We propose a simple noisy label detection method that prevents error propagation from the input layer. The approach is based on the observation that the projection of noisy labels is learned through memorisation at advanced stages of learning, and that the Pearson correlation is sensitive to outliers. Extensive experiments over real-world human-disagreement annotations as well as randomly-corrupted …


Led Down The Rabbit Hole: Exploring The Potential Of Global Attention For Biomedical Multi-Document Summarisation, Yulia Otmakhova, Hung Thinh Truong, Timothy Baldwin, Trevor Cohn, Karin Verspoor, Jey Han Lau Sep 2022

Led Down The Rabbit Hole: Exploring The Potential Of Global Attention For Biomedical Multi-Document Summarisation, Yulia Otmakhova, Hung Thinh Truong, Timothy Baldwin, Trevor Cohn, Karin Verspoor, Jey Han Lau

Natural Language Processing Faculty Publications

In this paper we report on our submission to the Multidocument Summarisation for Literature Review (MSLR) shared task. Specifically, we adapt PRIMERA (Xiao et al., 2022) to the biomedical domain by placing global attention on important biomedical entities in several ways. We analyse the outputs of the 23 resulting models, and report patterns in the results related to the presence of additional global attention, number of training steps, and the input configuration. © 2022, CC BY-SA.


Unsupervised Lexical Substitution With Decontextualised Embeddings, Takashi Wada, Timothy Baldwin, Yuji Matsumoto, Jey Han Lau Sep 2022

Unsupervised Lexical Substitution With Decontextualised Embeddings, Takashi Wada, Timothy Baldwin, Yuji Matsumoto, Jey Han Lau

Natural Language Processing Faculty Publications

We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We conduct experiments in English and Italian, and show that our method substantially outperforms strong baselines and establishes a new state-of-the-art without any explicit supervision or fine-tuning. We further show that our method performs particularly well at predicting low-frequency substitutes, and also generates a diverse list of …


Overview Of The Clef-2022 Checkthat! Lab Task 2 On Detecting Previously Fact-Checked Claims, Preslav Nakov, Giovanni Da San Martino, Firoj Alam, Shaden Shaar, Hamdy Mubarak, Nikolay Babulkov Sep 2022

Overview Of The Clef-2022 Checkthat! Lab Task 2 On Detecting Previously Fact-Checked Claims, Preslav Nakov, Giovanni Da San Martino, Firoj Alam, Shaden Shaar, Hamdy Mubarak, Nikolay Babulkov

Natural Language Processing Faculty Publications

We describe the fourth edition of the CheckThat! Lab, part of the 2022 Conference and Labs of the Evaluation Forum (CLEF). The lab evaluates technology supporting three tasks related to factuality, and it covers seven languages such as Arabic, Bulgarian, Dutch, English, German, Spanish, and Turkish. Here, we present the task 2, which asks to detect previously fact-checked claims (in two languages). A total of six teams participated in this task, submitted a total of 37 runs, and most submissions managed to achieve sizable improvements over the baselines using transformer based models such as BERT, RoBERTa. In this paper, we …


Overview Of The Clef-2022 Checkthat! Lab Task 1 On Identifying Relevant Claims In Tweets, Preslav Nakov, Alberto Barrón-Cedeño, Giovanni Da San Martino, Firoj Alam, Mucahid Kutlu, Wajdi Zaghouani, Mucahid Kutlu, Wajdi Zaghouani, Chengkai Li, Shaden Shaar, Hamdy Mubarak, Alex Nikolov Sep 2022

Overview Of The Clef-2022 Checkthat! Lab Task 1 On Identifying Relevant Claims In Tweets, Preslav Nakov, Alberto Barrón-Cedeño, Giovanni Da San Martino, Firoj Alam, Mucahid Kutlu, Wajdi Zaghouani, Mucahid Kutlu, Wajdi Zaghouani, Chengkai Li, Shaden Shaar, Hamdy Mubarak, Alex Nikolov

Natural Language Processing Faculty Publications

We present an overview of CheckThat! lab 2022 Task 1, part of the 2022 Conference and Labs of the Evaluation Forum (CLEF). Task 1 asked to predict which posts in a Twitter stream are worth fact-checking, focusing on COVID-19 and politics in six languages: Arabic, Bulgarian, Dutch, English, Spanish, and Turkish. A total of 19 teams participated and most submissions managed to achieve sizable improvements over the baselines using Transformer-based models such as BERT and GPT-3. Across the four subtasks, approaches that targetted multiple languages (be it individually or in conjunction, in general obtained the best performance. We describe the …


Nusax: Multilingual Parallel Sentiment Dataset For 10 Indonesian Local Languages, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau May 2022

Nusax: Multilingual Parallel Sentiment Dataset For 10 Indonesian Local Languages, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau

Natural Language Processing Faculty Publications

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes …


What Does It Take To Bake A Cake? The Reciperef Corpus And Anaphora Resolution In Procedural Text, Biaoyan Fang, Timothy Baldwin, Karin Verspoor May 2022

What Does It Take To Bake A Cake? The Reciperef Corpus And Anaphora Resolution In Procedural Text, Biaoyan Fang, Timothy Baldwin, Karin Verspoor

Natural Language Processing Faculty Publications

Procedural text contains rich anaphoric phenomena, yet has not received much attention in NLP. To fill this gap, we investigate the textual properties of two types of procedural text, recipes and chemical patents, and generalize an anaphora annotation framework developed for the chemical domain for modeling anaphoric phenomena in recipes. We apply this framework to annotate the RecipeRef corpus with both bridging and coreference relations. Through comparison to chemical patents, we show the complexity of anaphora resolution in recipes. We demonstrate empirically that transfer learning from the chemical domain improves resolution of anaphora in recipes, suggesting transferability of general procedural …


Improving Negation Detection With Negation-Focused Pre-Training, Hung Thinh Truong, Timothy Baldwin, Trevor Cohn, Karin Verspoor Apr 2022

Improving Negation Detection With Negation-Focused Pre-Training, Hung Thinh Truong, Timothy Baldwin, Trevor Cohn, Karin Verspoor

Natural Language Processing Faculty Publications

Negation is a common linguistic feature that is crucial in many language understanding tasks, yet it remains a hard problem due to diversity in its expression in different types of text. Recent work has shown that state-of-the-art NLP models underperform on samples containing negation in various tasks, and that negation detection models do not transfer well across domains. We propose a new negation-focused pre-training strategy, involving targeted data augmentation and negation masking, to better incorporate negation information into language models. Extensive experiments on common benchmarks show that our proposed approach improves negation detection performance and generalizability over the strong baseline …


Unsupervised Automatic Speech Recognition: A Review, Hanan Aldarmaki, Asad Ullah, Sreepratha Ram, Nazar Zaki Apr 2022

Unsupervised Automatic Speech Recognition: A Review, Hanan Aldarmaki, Asad Ullah, Sreepratha Ram, Nazar Zaki

Natural Language Processing Faculty Publications

Automatic Speech Recognition (ASR) systems can be trained to achieve remarkable performance given large amounts of manually transcribed speech, but large labeled data sets can be difficult or expensive to acquire for all languages of interest. In this paper, we review the research literature to identify models and ideas that could lead to fully unsupervised ASR, including unsupervised sub-word and word modeling, unsupervised segmentation of the speech signal, and unsupervised mapping from speech segments to text. The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum …