Open Access. Powered by Scholars. Published by Universities.®

Databases and Information Systems Commons

Open Access. Powered by Scholars. Published by Universities.®

Programming Languages and Compilers

Institution
Keyword
Publication Year
Publication
Publication Type
File Type

Articles 1 - 30 of 150

Full-Text Articles in Databases and Information Systems

What Does One Billion Dollars Look Like?: Visualizing Extreme Wealth, William Mahoney Luckman Feb 2024

What Does One Billion Dollars Look Like?: Visualizing Extreme Wealth, William Mahoney Luckman

Dissertations, Theses, and Capstone Projects

The word “billion” is a mathematical abstraction related to “big,” but it is difficult to understand the vast difference in value between one million and one billion; even harder to understand the vast difference in purchasing power between one billion dollars, and the average U.S. yearly income. Perhaps most difficult to conceive of is what that purchasing power and huge mass of capital translates to in terms of power. This project blends design, text, facts, and figures into an interactive narrative website that helps the user better understand their position in relation to extreme wealth: https://whatdoesonebilliondollarslooklike.website/

The site incorporates …


Ensuring Non-Repudiation In Long-Distance Constrained Devices, Ethan Blum Dec 2023

Ensuring Non-Repudiation In Long-Distance Constrained Devices, Ethan Blum

Undergraduate Honors Theses

Satellite communication is essential for the exploration and study of space. Satellites allow communications with many devices and systems residing in space and on the surface of celestial bodies from ground stations on Earth. However, with the rise of Ground Station as a Service (GsaaS), the ability to efficiently send action commands to distant satellites must ensure non-repudiation such that an attacker is unable to send malicious commands to distant satellites. Distant satellites are also constrained devices and rely on limited power, meaning security on these devices is minimal. Therefore, this study attempted to propose a novel algorithm to allow …


Llm-Adapters: An Adapter Family For Parameter-Efficient Fine-Tuning Of Large Language Models, Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, Roy Ka-Wei Lee Dec 2023

Llm-Adapters: An Adapter Family For Parameter-Efficient Fine-Tuning Of Large Language Models, Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, Roy Ka-Wei Lee

Research Collection School Of Computing and Information Systems

The success of large language models (LLMs), like GPT-4 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by finetuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance. To enable further research on PEFT methods of LLMs, this paper presents LLMAdapters, an easy-to-use framework that integrates various adapters into LLMs and …


Examining The Inter-Consistency Of Large Language Models: An In-Depth Analysis Via Debate, Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, Bing Qin Dec 2023

Examining The Inter-Consistency Of Large Language Models: An In-Depth Analysis Via Debate, Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, Bing Qin

Research Collection School Of Computing and Information Systems

Large Language Models (LLMs) have shown impressive capabilities in various applications, but they still face various inconsistency issues. Existing works primarily focus on the inconsistency issues within a single LLM, while we complementarily explore the inter-consistency among multiple LLMs for collaboration. To examine whether LLMs can collaborate effectively to achieve a consensus for a shared goal, we focus on commonsense reasoning, and introduce a formal debate framework (FORD) to conduct a three-stage debate among LLMs with real-world scenarios alignment: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on various datasets, LLMs can effectively collaborate to reach a consensus …


A Comprehensive Evaluation Of Large Language Models On Legal Judgment Prediction, Ruihao Shui, Yixin Cao, Xiang Wang, Tat-Seng Chua Dec 2023

A Comprehensive Evaluation Of Large Language Models On Legal Judgment Prediction, Ruihao Shui, Yixin Cao, Xiang Wang, Tat-Seng Chua

Research Collection School Of Computing and Information Systems

Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4’s law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the law, we design practical baseline solutions based on LLMs and test on the task of legal judgment prediction. In our solutions, LLMs can work alone to answer open questions or coordinate with an information retrieval (IR) system to learn from similar cases or solve simplified multi-choice questions. We show that similar cases and multi-choice options, namely label candidates, included in prompts …


Large Language Model Is Not A Good Few-Shot Information Extractor, But A Good Reranker For Hard Samples!, Yubo Ma, Yixin Cao, Yongchin Hong, Aixin Sun Dec 2023

Large Language Model Is Not A Good Few-Shot Information Extractor, But A Good Reranker For Hard Samples!, Yubo Ma, Yixin Cao, Yongchin Hong, Aixin Sun

Research Collection School Of Computing and Information Systems

Large Language Models (LLMs) have made remarkable strides in various tasks. However, whether they are competitive few-shot solvers for information extraction (IE) tasks and surpass fine-tuned small Pre-trained Language Models (SLMs) remains an open problem. This paper aims to provide a thorough answer to this problem, and moreover, to explore an approach towards effective and economical IE systems that combine the strengths of LLMs and SLMs. Through extensive experiments on nine datasets across four IE tasks, we show that LLMs are not effective few-shot information extractors in general, given their unsatisfactory performance in most settings and the high latency and …


Benchmarking Foundation Models With Language-Model-As-An-Examiner, Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou Dec 2023

Benchmarking Foundation Models With Language-Model-As-An-Examiner, Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou

Research Collection School Of Computing and Information Systems

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model’s ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility …


Molca: Molecular Graph-Language Modeling With Cross-Modal Projector And Uni-Modal Adapter, Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, Tat-Seng Chua Dec 2023

Molca: Molecular Graph-Language Modeling With Cross-Modal Projector And Uni-Modal Adapter, Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, Tat-Seng Chua

Research Collection School Of Computing and Information Systems

Language Models (LMs) have demonstrated impressive molecule understanding ability on various 1D text-related tasks. However, they inherently lack 2D graph perception — a critical ability of human professionals in comprehending molecules’ topological structures. To bridge this gap, we propose MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter. MolCA enables an LM (i.e., Galactica) to understand both text- and graph-based molecular contents via the cross-modal projector. Specifically, the cross-modal projector is implemented as a QFormer to connect a graph encoder’s representation space and an LM’s text space. Further, MolCA employs a uni-modal adapter (i.e., LoRA) for the LM’s efficient …


Disentangling Transformer Language Models As Superposed Topic Models, Jia Peng Lim, Hady Wirawan Lauw Dec 2023

Disentangling Transformer Language Models As Superposed Topic Models, Jia Peng Lim, Hady Wirawan Lauw

Research Collection School Of Computing and Information Systems

Topic Modelling is an established research area where the quality of a given topic is measured using coherence metrics. Often, we infer topics from Neural Topic Models (NTM) by interpreting their decoder weights, consisting of top-activated words projected from individual neurons. Transformer-based Language Models (TLM) similarly consist of decoder weights. However, due to its hypothesised superposition properties, the final logits originating from the residual path are considered uninterpretable. Therefore, we posit that we can interpret TLM as superposed NTM by proposing a novel weight-based, model-agnostic and corpus-agnostic approach to search and disentangle decoder-only TLM, potentially mapping individual neurons to multiple …


A Black-Box Attack On Code Models Via Representation Nearest Neighbor Search, Jie Zhang, Wei Ma, Qiang Hu, Shangqing Liu, Xiaofei Xie, Yves Le Traon, Yang Liu Dec 2023

A Black-Box Attack On Code Models Via Representation Nearest Neighbor Search, Jie Zhang, Wei Ma, Qiang Hu, Shangqing Liu, Xiaofei Xie, Yves Le Traon, Yang Liu

Research Collection School Of Computing and Information Systems

Existing methods for generating adversarial code examples face several challenges: limted availability of substitute variables, high verification costs for these substitutes, and the creation of adversarial samples with noticeable perturbations. To address these concerns, our proposed approach, RNNS, uses a search seed based on historical attacks to find potential adversarial substitutes. Rather than directly using the discrete substitutes, they are mapped to a continuous vector space using a pre-trained variable name encoder. Based on the vector representation, RNNS predicts and selects better substitutes for attacks. We evaluated the performance of RNNS across six coding tasks encompassing three programming languages: Java, …


Safe Mdp Planning By Learning Temporal Patterns Of Undesirable Trajectories And Averting Negative Side Effects, Siow Meng Low, Akshat Kumar, Scott Sanner Jul 2023

Safe Mdp Planning By Learning Temporal Patterns Of Undesirable Trajectories And Averting Negative Side Effects, Siow Meng Low, Akshat Kumar, Scott Sanner

Research Collection School Of Computing and Information Systems

In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. In real world, often the state representation used may lack sufficient fidelity to specify such safety constraints. Operating based on an incomplete model can often produce unintended negative side effects (NSEs). To address these challenges, first, we associate safety signals with state-action trajectories (rather than just immediate state-action). This makes our safety model highly general. We also assume categorical safety labels are given for different trajectories, rather than a numerical cost function, which is harder to specify by the …


Plan-And-Solve Prompting: Improving Zero-Shot Chain-Of-Thought Reasoning By Large Language Models, Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, Ee-Peng Lim Jul 2023

Plan-And-Solve Prompting: Improving Zero-Shot Chain-Of-Thought Reasoning By Large Language Models, Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, Ee-Peng Lim

Research Collection School Of Computing and Information Systems

Large language models (LLMs) have recently been shown to deliver impressive performance in various NLP tasks. To tackle multi-step reasoning tasks, few-shot chain-of-thought (CoT) prompting includes a few manually crafted step-by-step reasoning demonstrations which enable LLMs to explicitly generate reasoning steps and improve their reasoning task accuracy. To eliminate the manual effort, Zeroshot-CoT concatenates the target problem statement with “Let’s think step by step” as an input prompt to LLMs. Despite the success of Zero-shot-CoT, it still suffers from three pitfalls: calculation errors, missing-step errors, and semantic misunderstanding errors. To address the missing-step errors, we propose Planand-Solve (PS) Prompting. It …


Web Repository Of Southern’S Research Projects, Rebecca Zaldivar, Siegwart Mayr Apr 2023

Web Repository Of Southern’S Research Projects, Rebecca Zaldivar, Siegwart Mayr

Campus Research Day

A research repository was created so that Southern Adventist University has a central place for all past, current, and future research projects. This repository is a web application created with the use of the Yii framework that utilizes PHP and SQL. The repository has a user-friendly interface to let authorized users upload the information about their projects. Also, professors and students from different departments can see the list of projects per department.


Cie Text Analysis: Narrative Of The Life Of Frederick Douglass, The Declaration Of Independence, And The Declaration Of Sentiments, Arianna Knipe Apr 2023

Cie Text Analysis: Narrative Of The Life Of Frederick Douglass, The Declaration Of Independence, And The Declaration Of Sentiments, Arianna Knipe

Mathematics and Computer Science Presentations

Our STAT-451 class has worked with analyzing the words from CIE texts and assigning them to a sentiment or feeling and comparing them with one another using RStudio. This project analyzes texts from three sources: The Narrative of the Life of Frederick Douglass, The Declaration of Independence and the Declaration of Sentiments.


Code Will Tell: Visual Identification Of Ponzi Schemes On Ethereum, Xiaolin Wen, Kim Siang Yeo, Yong Wang, Ling Cheng, Feida Zhu, Min Zhu Apr 2023

Code Will Tell: Visual Identification Of Ponzi Schemes On Ethereum, Xiaolin Wen, Kim Siang Yeo, Yong Wang, Ling Cheng, Feida Zhu, Min Zhu

Research Collection School Of Computing and Information Systems

Ethereum has become a popular blockchain with smart contracts for investors nowadays. Due to the decentralization and anonymity of Ethereum, Ponzi schemes have been easily deployed and caused significant losses to investors. However, there are still no explainable and effective methods to help investors easily identify Ponzi schemes and validate whether a smart contract is actually a Ponzi scheme. To fill the research gap, we propose PonziLens, a novel visualization approach to help investors achieve early identification of Ponzi schemes by investigating the operation codes of smart contracts. Specifically, we conduct symbolic execution of opcode and extract the control flow …


Chatgpt As Metamorphosis Designer For The Future Of Artificial Intelligence (Ai): A Conceptual Investigation, Amarjit Kumar Singh (Library Assistant), Dr. Pankaj Mathur (Deputy Librarian) Mar 2023

Chatgpt As Metamorphosis Designer For The Future Of Artificial Intelligence (Ai): A Conceptual Investigation, Amarjit Kumar Singh (Library Assistant), Dr. Pankaj Mathur (Deputy Librarian)

Library Philosophy and Practice (e-journal)

Abstract

Purpose: The purpose of this research paper is to explore ChatGPT’s potential as an innovative designer tool for the future development of artificial intelligence. Specifically, this conceptual investigation aims to analyze ChatGPT’s capabilities as a tool for designing and developing near about human intelligent systems for futuristic used and developed in the field of Artificial Intelligence (AI). Also with the helps of this paper, researchers are analyzed the strengths and weaknesses of ChatGPT as a tool, and identify possible areas for improvement in its development and implementation. This investigation focused on the various features and functions of ChatGPT that …


Contrastive Learning Approach To Word-In-Context Task For Low-Resource Languages, Pei-Chi Lo, Yang-Yin Lee, Hsien-Hao Chen, Agus Trisnajaya Kwee, Ee-Peng Lim Feb 2023

Contrastive Learning Approach To Word-In-Context Task For Low-Resource Languages, Pei-Chi Lo, Yang-Yin Lee, Hsien-Hao Chen, Agus Trisnajaya Kwee, Ee-Peng Lim

Research Collection School Of Computing and Information Systems

Word in context (WiC) task aims to determine whether a target word’s occurrences in two sentences share the same sense. In this paper, we propose a Contrastive Learning WiC (CLWiC) framework to improve the learning of sentence/word representations and classification of target word senses in the sentence pair when performing WiC on lowresource languages. In representation learning, CLWiC trains a pre-trained language model’s ability to cope with lowresource languages using both unsupervised and supervised contrastive learning. The WiC classifier learning further finetunes the language model with WiC classification loss under two classifier architecture options, SGBERT and WiSBERT, which use single-encoder …


Using Landsat Satellite Imagery To Estimate Groundcover In The Grainbelt Of Western Australia, Justin Laycock, Nick Middleton, Karen Holmes Dec 2022

Using Landsat Satellite Imagery To Estimate Groundcover In The Grainbelt Of Western Australia, Justin Laycock, Nick Middleton, Karen Holmes

Resource management technical reports

Maintaining vegetative groundcover is an important component of sustainable agricultural systems and plays a critical function for soil and land conservation in Western Australia’s (WA) grainbelt (the south-west cropping region). This report describes how satellite imagery can be used to quantitatively and objectively estimate total vegetative groundcover, both in near real time and historically across large areas. We used the Landsat seasonal fractional groundcover products developed by the Joint Remote Sensing Research Program from the extensive archive of Landsat imagery. These products provide an estimate of the percentage of green vegetation, non-green vegetation and bare soil for each 30 m …


R2f: A General Retrieval, Reading And Fusion Framework For Document-Level Natural Language Inference, Hao Wang, Yixin Cao, Yangguang Li, Zhen Huang, Kun Wang, Jing Shao Dec 2022

R2f: A General Retrieval, Reading And Fusion Framework For Document-Level Natural Language Inference, Hao Wang, Yixin Cao, Yangguang Li, Zhen Huang, Kun Wang, Jing Shao

Research Collection School Of Computing and Information Systems

Document-level natural language inference (DocNLI) is a new challenging task in natural language processing, aiming at judging the entailment relationship between a pair of hypothesis and premise documents. Current datasets and baselines largely follow sentence-level settings, but fail to address the issues raised by longer documents. In this paper, we establish a general solution, named Retrieval, Reading and Fusion (R2F) framework, and a new setting, by analyzing the main challenges of DocNLI: interpretability, long-range dependency, and cross-sentence inference. The basic idea of the framework is to simplify document-level task into a set of sentence-level tasks, and improve both performance and …


Codematcher: A Tool For Large-Scale Code Search Based On Query Semantics Matching, Chao Liu, Xuanlin Bao, Xin Xia, Meng Yan, David Lo, Ting Zhang Nov 2022

Codematcher: A Tool For Large-Scale Code Search Based On Query Semantics Matching, Chao Liu, Xuanlin Bao, Xin Xia, Meng Yan, David Lo, Ting Zhang

Research Collection School Of Computing and Information Systems

Due to the emergence of large-scale codebases, such as GitHub and Gitee, searching and reusing existing code can help developers substantially improve software development productivity. Over the years, many code search tools have been developed. Early tools leveraged the information retrieval (IR) technique to perform an efficient code search for a frequently changed large-scale codebase. However, the search accuracy was low due to the semantic mismatch between query and code. In the recent years, many tools leveraged Deep Learning (DL) technique to address this issue. But the DL-based tools are slow and the search accuracy is unstable.In this paper, we …


Vlstereoset: A Study Of Stereotypical Bias In Pre-Trained Vision-Language Models, Kankan Zhou, Yibin Lai, Jing Jiang Nov 2022

Vlstereoset: A Study Of Stereotypical Bias In Pre-Trained Vision-Language Models, Kankan Zhou, Yibin Lai, Jing Jiang

Research Collection School Of Computing and Information Systems

In this paper we study how to measure stereotypical bias in pre-trained vision-language models. We leverage a recently released text-only dataset, StereoSet, which covers a wide range of stereotypical bias, and extend it into a vision-language probing dataset called VLStereoSet to measure stereotypical bias in vision-language models. We analyze the differences between text and image and propose a probing task that detects bias by evaluating a model’s tendency to pick stereotypical statements as captions for anti-stereotypical images. We further define several metrics to measure both a vision-language model’s overall stereotypical bias and its intra-modal and inter-modal bias. Experiments on six …


Investigating Bloom's Cognitive Skills In Foundation And Advanced Programming Courses From Students' Discussions, Joel Jer Wei Lim, Gottipati Swapna, Kyong Jin Shim Nov 2022

Investigating Bloom's Cognitive Skills In Foundation And Advanced Programming Courses From Students' Discussions, Joel Jer Wei Lim, Gottipati Swapna, Kyong Jin Shim

Research Collection School Of Computing and Information Systems

Programming courses provide students with the skills to develop complex business applications. Teaching and learning programming is challenging, and collaborative learning is proposed to help with this challenge. Online discussion forums promote networking with other learners such that they can build knowledge collaboratively. It aids students open their horizons of thought processes to acquire cognitive skills. Cognitive analysis of discussion is critical to understand students' learning process. In this paper, we propose Bloom's taxonomy based cognitive model for programming discussion forums. We present machine learning (ML) based solution to extract students' cognitive skills. Our evaluations on compupting courses show that …


Robustness And Cross-Lingual Transfer: An Exploration Of Out-Of-Distribution Scenario In Natural Language Processing, Yu, Sicheng Sep 2022

Robustness And Cross-Lingual Transfer: An Exploration Of Out-Of-Distribution Scenario In Natural Language Processing, Yu, Sicheng

Dissertations and Theses Collection (Open Access)

Most traditional machine learning or deep learning methods are based on the premise that training data and test data are independent and identical distributed, i.e., IID. However, it is just an ideal situation. In real-world applications, test set and training data often follow different distributions, which we refer to as the out of distribution, i.e., OOD, setting. As a result, models trained with traditional methods always suffer from an undesirable performance drop on the OOD test set. It's necessary to develop techniques to solve this problem for real applications. In this dissertation, we present four pieces of work in the …


A Weakly Supervised Propagation Model For Rumor Verification And Stance Detection With Multiple Instance Learning, Ruichao Yang, Jing Ma, Hongzhan Lin, Wei Gao Jul 2022

A Weakly Supervised Propagation Model For Rumor Verification And Stance Detection With Multiple Instance Learning, Ruichao Yang, Jing Ma, Hongzhan Lin, Wei Gao

Research Collection School Of Computing and Information Systems

The diffusion of rumors on social media generally follows a propagation tree structure, which provides valuable clues on how an original message is transmitted and responded by users over time. Recent studies reveal that rumor verification and stance detection are two relevant tasks that can jointly enhance each other despite their differences. For example, rumors can be debunked by cross-checking the stances conveyed by their relevant posts, and stances are also conditioned on the nature of the rumor. However, stance detection typically requires a large training set of labeled stances at post level, which are rare and costly to annotate. …


Early Rumor Detection Using Neural Hawkes Process With A New Benchmark Dataset, Fengzhu Zeng, Wei Gao Jul 2022

Early Rumor Detection Using Neural Hawkes Process With A New Benchmark Dataset, Fengzhu Zeng, Wei Gao

Research Collection School Of Computing and Information Systems

Little attention has been paid on EArly Rumor Detection (EARD), and EARD performance was evaluated inappropriately on a few datasets where the actual early-stage information is largely missing. To reverse such situation, we construct BEARD, a new Benchmark dataset for EARD, based on claims from fact-checking websites by trying to gather as many early relevant posts as possible. We also propose HEARD, a novel model based on neural Hawkes process for EARD, which can guide a generic rumor detection model to make timely, accurate and stable predictions. Experiments show that HEARD achieves effective EARD performance on two commonly used general …


Blocklens: Visual Analytics Of Student Coding Behaviors In Block-Based Programming Environments., Sean Tung, Huan Wei, Haotian Li, Yong Wang, Meng Xia, Huamin. Qu Jun 2022

Blocklens: Visual Analytics Of Student Coding Behaviors In Block-Based Programming Environments., Sean Tung, Huan Wei, Haotian Li, Yong Wang, Meng Xia, Huamin. Qu

Research Collection School Of Computing and Information Systems

Block-based programming environments have been widely used to introduce K-12 students to coding. To guide students effectively, instructors and platform owners often need to understand behaviors like how students solve certain questions or where they get stuck and why. However, it is challenging for them to effectively analyze students’ coding data. To this end, we propose BlockLens, a novel visual analytics system to assist instructors and platform owners in analyzing students’ block-based coding behaviors, mistakes, and problem-solving patterns. BlockLens enables the grouping of students by question progress and performance, identification of common problem-solving strategies and pitfalls, and presentation of insights …


Using A Bert-Based Ensemble Network For Abusive Language Detection, Noah Ballinger May 2022

Using A Bert-Based Ensemble Network For Abusive Language Detection, Noah Ballinger

Computer Science and Computer Engineering Undergraduate Honors Theses

Over the past two decades, online discussion has skyrocketed in scope and scale. However, so has the amount of toxicity and offensive posts on social media and other discussion sites. Despite this rise in prevalence, the ability to automatically moderate online discussion platforms has seen minimal development. Recently, though, as the capabilities of artificial intelligence (AI) continue to improve, the potential of AI-based detection of harmful internet content has become a real possibility. In the past couple years, there has been a surge in performance on tasks in the field of natural language processing, mainly due to the development of …


Exploring And Adapting Chinese Gpt To Pinyin Input Method, Minghuan Tan, Yong Dai, Duyu Tang, Zhangyin Feng, Guoping Huang, Jing Jiang, Jiwei Li, Shuming Shi May 2022

Exploring And Adapting Chinese Gpt To Pinyin Input Method, Minghuan Tan, Yong Dai, Duyu Tang, Zhangyin Feng, Guoping Huang, Jing Jiang, Jiwei Li, Shuming Shi

Research Collection School Of Computing and Information Systems

While GPT has become the de-facto method for text generation tasks, its application to pinyin input method remains unexplored. In this work, we make the first exploration to leverage Chinese GPT for pinyin input method. We find that a frozen GPT achieves state-of-the-art performance on perfect pinyin. However, the performance drops dramatically when the input includes abbreviated pinyin. A reason is that an abbreviated pinyin can be mapped to many perfect pinyin, which links to even larger number of Chinese characters. We mitigate this issue with two strategies, including enriching the context with pinyin and optimizing the training process to …


Translate-Train Embracing Translationese Artifacts, Sicheng Yu, Qianru Sun, Hao Zhang, Jing Jiang May 2022

Translate-Train Embracing Translationese Artifacts, Sicheng Yu, Qianru Sun, Hao Zhang, Jing Jiang

Research Collection School Of Computing and Information Systems

Translate-train is a general training approach to multilingual tasks. The key idea is to use the translator of the target language to generate training data to mitigate the gap between the source and target languages. However, its performance is often hampered by the artifacts in the translated texts (translationese). We discover that such artifacts have common patterns in different languages and can be modeled by deep learning, and subsequently propose an approach to conduct translate-train using Translationese Embracing the effect of Artifacts (TEA). TEA learns to mitigate such effect on the training data of a source language (whose original and …


On The Influence Of Biases In Bug Localization: Evaluation And Benchmark, Ratnadira Widyasari, Stefanus Agus Haryono, Ferdian Thung, Jieke Shi, Constance Tan, Fiona Wee, Jack Phan, David Lo Mar 2022

On The Influence Of Biases In Bug Localization: Evaluation And Benchmark, Ratnadira Widyasari, Stefanus Agus Haryono, Ferdian Thung, Jieke Shi, Constance Tan, Fiona Wee, Jack Phan, David Lo

Research Collection School Of Computing and Information Systems

Bug localization is the task of identifying parts of thesource code that needs to be changed to resolve a bug report.As this task is difficult, automatic bug localization tools havebeen proposed. The development and evaluation of these toolsrely on the availability of high-quality bug report datasets. In2014, Kochhar et al. identified three biases in datasets used toevaluate bug localization techniques: (1) misclassified bug report,(2) already localized bug report, and (3) incorrect ground truthfile in a bug report. They reported that already localized bugreports statistically significantly and substantially impact buglocalization results, and thus should be removed. However, theirevaluation is still limited, …