Open Access. Powered by Scholars. Published by Universities.®
Graphics and Human Computer Interfaces Commons™
Open Access. Powered by Scholars. Published by Universities.®
- Discipline
Articles 1 - 12 of 12
Full-Text Articles in Graphics and Human Computer Interfaces
Catnet: Cross-Modal Fusion For Audio-Visual Speech Recognition, Xingmei Wang, Jianchen Mi, Boquan Li, Yixu Zhao, Jiaxiang Meng
Catnet: Cross-Modal Fusion For Audio-Visual Speech Recognition, Xingmei Wang, Jianchen Mi, Boquan Li, Yixu Zhao, Jiaxiang Meng
Research Collection School Of Computing and Information Systems
Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio–Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio–visual speech recognition model, named CATNet. First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, …
Causal Interventional Training For Image Recognition, Wei Qin, Hanwang Zhang, Richang Hong, Ee-Peng Lim, Qianru Sun
Causal Interventional Training For Image Recognition, Wei Qin, Hanwang Zhang, Richang Hong, Ee-Peng Lim, Qianru Sun
Research Collection School Of Computing and Information Systems
Deep learning models often fit undesired dataset bias in training. In this paper, we formulate the bias using causal inference, which helps us uncover the ever-elusive causalities among the key factors in training, and thus pursue the desired causal effect without the bias. We start from revisiting the process of building a visual recognition system, and then propose a structural causal model (SCM) for the key variables involved in dataset collection and recognition model: object, common sense, bias, context, and label prediction. Based on the SCM, one can observe that there are “good” and “bad” biases. Intuitively, in the image …
Hierarchical Semantic-Aware Neural Code Representation, Yuan Jiang, Xiaohong Su, Christoph Treude, Tiantian Wang
Hierarchical Semantic-Aware Neural Code Representation, Yuan Jiang, Xiaohong Su, Christoph Treude, Tiantian Wang
Research Collection School Of Computing and Information Systems
Code representation is a fundamental problem in many software engineering tasks. Despite the effort made by many researchers, it is still hard for existing methods to fully extract syntactic, structural and sequential features of source code, which form the hierarchical semantics of the program and are necessary to achieve a deeper code understanding. To alleviate this difficulty, we propose a new supervised approach based on the novel use of Tree-LSTM to incorporate the sequential and the global semantic features of programs explicitly into the representation model. Unlike previous techniques, our proposed model can not only learn low-level syntactic information within …
Comai: Enabling Lightweight, Collaborative Intelligence By Retrofitting Vision Dnns, Kasthuri Jayarajah, Dhanuja Wanniarachchige, Tarek Abdelzaher, Archan Misra
Comai: Enabling Lightweight, Collaborative Intelligence By Retrofitting Vision Dnns, Kasthuri Jayarajah, Dhanuja Wanniarachchige, Tarek Abdelzaher, Archan Misra
Research Collection School Of Computing and Information Systems
While Deep Neural Network (DNN) models have transformed machine vision capabilities, their extremely high computational complexity and model sizes present a formidable deployment roadblock for AIoT applications. We show that the complexity-vs-accuracy-vs-communication tradeoffs for such DNN models can be significantly addressed via a novel, lightweight form of “collaborative machine intelligence” that requires only runtime changes to the inference process. In our proposed approach, called ComAI, the DNN pipelines of different vision sensors share intermediate processing state with one another, effectively providing hints about objects located within their mutually-overlapping Field-of-Views (FoVs). CoMAI uses two novel techniques: (a) a secondary shallow ML …
Deep Graph-Level Anomaly Detection By Glocal Knowledge Distillation, Rongrong Ma, Guansong Pang, Ling Chen, Anton Van Den Hengel
Deep Graph-Level Anomaly Detection By Glocal Knowledge Distillation, Rongrong Ma, Guansong Pang, Ling Chen, Anton Van Den Hengel
Research Collection School Of Computing and Information Systems
Graph-level anomaly detection (GAD) describes the problem of detecting graphs that are abnormal in their structure and/or the features of their nodes, as compared to other graphs. One of the challenges in GAD is to devise graph representations that enable the detection of both locally- and globally-anomalous graphs, i.e., graphs that are abnormal in their fine-grained (node-level) or holistic (graph-level) properties, respectively. To tackle this challenge we introduce a novel deep anomaly detection approach for GAD that learns rich global and local normal pattern information by joint random distillation of graph and node representations. The random distillation is achieved by …
Smart Scribbles For Image Matting, Yang Xin, Yu Qiao, Shaozhe Chen, Shengfeng He, Baocai Yin, Qiang Zhang, Xiaopeng Wei, Rynson W. H. Lau
Smart Scribbles For Image Matting, Yang Xin, Yu Qiao, Shaozhe Chen, Shengfeng He, Baocai Yin, Qiang Zhang, Xiaopeng Wei, Rynson W. H. Lau
Research Collection School Of Computing and Information Systems
Image matting is an ill-posed problem that usually requires additional user input, such as trimaps or scribbles. Drawing a fine trimap requires a large amount of user effort, while using scribbles can hardly obtain satisfactory alpha mattes for non-professional users. Some recent deep learning-based matting networks rely on large-scale composite datasets for training to improve performance, resulting in the occasional appearance of obvious artifacts when processing natural images. In this article, we explore the intrinsic relationship between user input and alpha mattes and strike a balance between user effort and the quality of alpha mattes. In particular, we propose an …
A Study Of Multi-Task And Region-Wise Deep Learning For Food Ingredient Recognition, Jingjing Chen, Bin Zhu, Chong-Wah Ngo, Tat-Seng Chua, Yu-Gang Jiang
A Study Of Multi-Task And Region-Wise Deep Learning For Food Ingredient Recognition, Jingjing Chen, Bin Zhu, Chong-Wah Ngo, Tat-Seng Chua, Yu-Gang Jiang
Research Collection School Of Computing and Information Systems
Food recognition has captured numerous research attention for its importance for health-related applications. The existing approaches mostly focus on the categorization of food according to dish names, while ignoring the underlying ingredient composition. In reality, two dishes with the same name do not necessarily share the exact list of ingredients. Therefore, the dishes under the same food category are not mandatorily equal in nutrition content. Nevertheless, due to limited datasets available with ingredient labels, the problem of ingredient recognition is often overlooked. Furthermore, as the number of ingredients is expected to be much less than the number of food categories, …
Multi-Modal Cooking Workflow Construction For Food Recipes, Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yugang Jiang, Tat-Seng Chua
Multi-Modal Cooking Workflow Construction For Food Recipes, Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yugang Jiang, Tat-Seng Chua
Research Collection School Of Computing and Information Systems
Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps …
Deep Learning Of Facial Embeddings And Facial Landmark Points For The Detection Of Academic Emotions, Hua Leong Fwa
Deep Learning Of Facial Embeddings And Facial Landmark Points For The Detection Of Academic Emotions, Hua Leong Fwa
Research Collection School Of Computing and Information Systems
Automatic emotion recognition is an actively researched area as emotion plays a pivotal role in effective human communications. Equipping a computer to understand and respond to human emotions has potential applications in many fields including education, medicine, transport and hospitality. In a classroom or online learning context, the basic emotions do not occur frequently and do not influence the learning process itself. The academic emotions such as engagement, frustration, confusion and boredom are the ones which are pivotal to sustaining the motivation of learners. In this study, we evaluated the use of deep learning on FaceNet embeddings and facial landmark …
Fusion Of Multimodal Embeddings For Ad-Hoc Video Search, Danny Francis, Phuong Anh Nguyen, Benoit Huet, Chong-Wah Ngo
Fusion Of Multimodal Embeddings For Ad-Hoc Video Search, Danny Francis, Phuong Anh Nguyen, Benoit Huet, Chong-Wah Ngo
Research Collection School Of Computing and Information Systems
The challenge of Ad-Hoc Video Search (AVS) originates from free-form (i.e., no pre-defined vocabulary) and freestyle (i.e., natural language) query description. Bridging the semantic gap between AVS queries and videos becomes highly difficult as evidenced from the low retrieval accuracy of AVS benchmarking in TRECVID. In this paper, we study a new method to fuse multimodal embeddings which have been derived based on completely disjoint datasets. This method is tested on two datasets for two distinct tasks: on MSR-VTT for unique video retrieval and on V3C1 for multiple videos retrieval.
Rotation Invariant Convolutions For 3d Point Clouds Deep Learning, Zhiyuan Zhang, Binh-Son Hua, David W. Rosen, Sai-Kit Yeung
Rotation Invariant Convolutions For 3d Point Clouds Deep Learning, Zhiyuan Zhang, Binh-Son Hua, David W. Rosen, Sai-Kit Yeung
Research Collection School Of Computing and Information Systems
Recent progresses in 3D deep learning has shown that it is possible to design special convolution operators to consume point cloud data. However, a typical drawback is that rotation invariance is often not guaranteed, resulting in networks that generalizes poorly to arbitrary rotations. In this paper, we introduce a novel convolution operator for point clouds that achieves rotation invariance. Our core idea is to use low-level rotation invariant geometric features such as distances and angles to design a convolution operator for point cloud learning. The well-known point ordering problem is also addressed by a binning approach seamlessly built into the …
Formresnet: Formatted Residual Learning For Image Restoration, Jianbo Jiao, Wei-Chih Tu, Shengfeng He
Formresnet: Formatted Residual Learning For Image Restoration, Jianbo Jiao, Wei-Chih Tu, Shengfeng He
Research Collection School Of Computing and Information Systems
In this paper, we propose a deep CNN to tackle the image restoration problem by learning the structured residual. Previous deep learning based methods directly learn the mapping from corrupted images to clean images, and may suffer from the gradient exploding/vanishing problems of deep neural networks. We propose to address the image restoration problem by learning the structured details and recovering the latent clean image together, from the shared information between the corrupted image and the latent image. In addition, instead of learning the pure difference (corruption), we propose to add a 'residual formatting layer' to format the residual to …