Open Access. Powered by Scholars. Published by Universities.®

Theory and Algorithms Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 16 of 16

Full-Text Articles in Theory and Algorithms

Cross-Modal Food Retrieval: Learning A Joint Embedding Of Food Images And Recipes With Semantic Consistency And Attention Mechanism;, Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Achananuparp Palakorn, Ee Peng Lim, Steven Hoi May 2021

Cross-Modal Food Retrieval: Learning A Joint Embedding Of Food Images And Recipes With Semantic Consistency And Attention Mechanism;, Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Achananuparp Palakorn, Ee Peng Lim, Steven Hoi

Research Collection School Of Computing and Information Systems

Food retrieval is an important task to perform analysis of food-related information, where we are interested in retrieving relevant information about the queried food item such as ingredients, cooking instructions, etc. In this paper, we investigate cross-modal retrieval between food images and cooking recipes. The goal is to learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. Two major challenges in addressing this problem are 1) large intra-variance and small inter-variance across cross-modal food data; and 2) difficulties in obtaining discriminative recipe representations. To address these …


Object Detection Meets Knowledge Graphs, Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, Vijay Chandrasekhar Aug 2017

Object Detection Meets Knowledge Graphs, Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, Vijay Chandrasekhar

Research Collection School Of Computing and Information Systems

Object detection in images is a crucial task in computer vision, with important applications ranging from security surveillance to autonomous vehicles. Existing state-of-the-art algorithms, including deep neural networks, only focus on utilizing features within an image itself, largely neglecting the vast amount of background knowledge about the real world. In this paper, we propose a novel framework of knowledge-aware object detection, which enables the integration of external knowledge such as knowledge graphs into any object detection algorithm. The framework employs the notion of semantic consistency to quantify and generalize knowledge, which improves object detection through a re-optimization process to achieve …


Event Detection With Zero Example: Select The Right And Suppress The Wrong Concepts, Yi-Jie Lu, Hao Zhang, Maaike De Boer, Chong-Wah Ngo Jun 2016

Event Detection With Zero Example: Select The Right And Suppress The Wrong Concepts, Yi-Jie Lu, Hao Zhang, Maaike De Boer, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

Complex video event detection without visual examples is a very challenging issue in multimedia retrieval. We present a state-of-the-art framework for event search without any need of exemplar videos and textual metadata in search corpus. To perform event search given only query words, the core of our framework is a large, pre-built bank of concept detectors which can understand the content of a video in the perspective of object, scene, action and activity concepts. Leveraging such knowledge can effectively narrow the semantic gap between textual query and the visual content of videos. Besides the large concept bank, this paper focuses …


Opinion Question Answering By Sentiment Clip Localization, Lei Pang, Chong-Wah Ngo Mar 2016

Opinion Question Answering By Sentiment Clip Localization, Lei Pang, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

This article considers multimedia question answering beyond factoid and how-to questions. We are interested in searching videos for answering opinion-oriented questions that are controversial and hotly debated. Examples of questions include "Should Edward Snowden be pardoned?" and "Obamacare-unconstitutional or not?". These questions often invoke emotional response, either positively or negatively, hence are likely to be better answered by videos than texts, due to the vivid display of emotional signals visible through facial expression and speaking tone. Nevertheless, a potential answer of duration 60s may be embedded in a video of 10min, resulting in degraded user experience compared to reading the …


Dictionary Pair Learning On Grassmann Manifolds For Image Denoising, Xianhua Zeng, Wei Bian, Wei Liu, Jialie Shen, Dacheng Tao Nov 2015

Dictionary Pair Learning On Grassmann Manifolds For Image Denoising, Xianhua Zeng, Wei Bian, Wei Liu, Jialie Shen, Dacheng Tao

Research Collection School Of Computing and Information Systems

Image denoising is a fundamental problem in computer vision and image processing that holds considerable practical importance for real-world applications. The traditional patch-based and sparse coding-driven image denoising methods convert 2D image patches into 1D vectors for further processing. Thus, these methods inevitably break down the inherent 2D geometric structure of natural images. To overcome this limitation pertaining to the previous image denoising methods, we propose a 2D image denoising model, namely, the dictionary pair learning (DPL) model, and we design a corresponding algorithm called the DPL on the Grassmann-manifold (DPLG) algorithm. The DPLG algorithm first learns an initial dictionary …


Placing Videos On A Semantic Hierarchy For Search Result Navigation, Song Tan, Yu-Gang Jiang, Chong-Wah Ngo Jun 2014

Placing Videos On A Semantic Hierarchy For Search Result Navigation, Song Tan, Yu-Gang Jiang, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

Organizing video search results in a list view is widely adopted by current commercial search engines, which cannot support efficient browsing for complex search topics that have multiple semantic facets. In this article, we propose to organize video search results in a highly structured way. Specifically, videos are placed on a semantic hierarchy that accurately organizes various facets of a given search topic. To pick the most suitable videos for each node of the hierarchy, we define and utilize three important criteria: relevance, uniqueness, and diversity. Extensive evaluations on a large YouTube video dataset demonstrate the effectiveness of our approach.


Snap-And-Ask: Answering Multimodal Question By Naming Visual Instance, Wei Zhang, Lei Pang, Chong-Wah Ngo Nov 2012

Snap-And-Ask: Answering Multimodal Question By Naming Visual Instance, Wei Zhang, Lei Pang, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

In real-life, it is easier to provide a visual cue when asking a question about a possibly unfamiliar topic, for example, asking the question, “Where was this crop circle found?”. Providing an image of the instance is far more convenient than texting a verbose description of the visual properties, especially when the name of the query instance is not known. Nevertheless, having to identify the visual instance before processing the question and eventually returning the answer makes multimodal question-answering technically challenging. This paper addresses the problem of visual-totext naming through the paradigm of answering-by-search in a two-stage computational framework, which …


Beyond Search: Event-Driven Summarization For Web Videos, Richard Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, Tat-Seng Chua Nov 2011

Beyond Search: Event-Driven Summarization For Web Videos, Richard Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, Tat-Seng Chua

Research Collection School Of Computing and Information Systems

The explosive growth of Web videos brings out the challenge of how to efficiently browse hundreds or even thousands of videos at a glance. Given an event-driven query, social media Web sites usually return a large number of videos that are diverse and noisy in a ranking list. Exploring such results will be time-consuming and thus degrades user experience. This article presents a novel scheme that is able to summarize the content of video search results by mining and threading "key" shots, such that users can get an overview of main content of these videos at a glance. The proposed …


Accelerating Near-Duplicate Video Matching By Combining Visual Similarity And Alignment Distortion, Hung-Khoon Tan, Xiao Wu, Chong-Wah Ngo, Wan-Lei Zhao Oct 2008

Accelerating Near-Duplicate Video Matching By Combining Visual Similarity And Alignment Distortion, Hung-Khoon Tan, Xiao Wu, Chong-Wah Ngo, Wan-Lei Zhao

Research Collection School Of Computing and Information Systems

In this paper, we investigate a novel approach to accelerate the matching of two video clips by exploiting the temporal coherence property inherent in the keyframe sequence of a video. Motivated by the fact that keyframe correspondences between near-duplicate videos typically follow certain spatial arrangements, such property could be employed to guide the alignment of two keyframe sequences. We set the alignment problem as an integer quadratic programming problem, where the cost function takes into account both the visual similarity of the corresponding keyframes as well as the alignment distortion among the set of correspondences. The set of keyframe-pairs found …


Fast Tracking Of Near-Duplicate Keyframes In Broadcast Domain With Transitivity Propagation, Chong-Wah Ngo, Wan-Lei Zhao, Yu-Gang Jiang Oct 2006

Fast Tracking Of Near-Duplicate Keyframes In Broadcast Domain With Transitivity Propagation, Chong-Wah Ngo, Wan-Lei Zhao, Yu-Gang Jiang

Research Collection School Of Computing and Information Systems

The identification of near-duplicate keyframe (NDK) pairs is a useful task for a variety of applications such as news story threading and content-based video search. In this paper, we propose a novel approach for the discovery and tracking of NDK pairs and threads in the broadcast domain. The detection of NDKs in a large data set is a challenging task due to the fact that when the data set increases linearly, the computational cost increases in a quadratic speed, and so does the number of false alarms. This paper explores the symmetric and transitive nature of near-duplicate for the effective …


Gestalt-Based Feature Similarity Measure In Trademark Database, Hui Jiang, Chong-Wah Ngo, Hung-Khoon Tan May 2006

Gestalt-Based Feature Similarity Measure In Trademark Database, Hui Jiang, Chong-Wah Ngo, Hung-Khoon Tan

Research Collection School Of Computing and Information Systems

Motivated by the studies in Gestalt principle, this paper describes a novel approach on the adaptive selection of visual features for trademark retrieval. We consider five kinds of visual saliencies: symmetry, continuity, proximity, parallelism and closure property. The first saliency is based on Zernike moments, while the others are modeled by geometric elements extracted illusively as a whole from a trademark. Given a query trademark, we adaptively determine the features appropriate for retrieval by investigating its visual saliencies. We show that in most cases, either geometric or symmetric features can give us good enough accuracy. To measure the similarity of …


Structuring Home Video By Snippet Detection And Pattern Parsing, Zailiang Pan, Chong-Wah Ngo Oct 2004

Structuring Home Video By Snippet Detection And Pattern Parsing, Zailiang Pan, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

Hand-held camcorders have been popularly used in capturing and documenting daily lives. Nonetheless, searching for personal memories in home videos is still a laborious task. This paper describes novel approaches in detecting snippets and patterns in home videos for content indexing. To deal with the fact that most shots are long and with handshake artifacts, a motion analysis algorithm based on Kalman filter and finite state machine is proposed to decompose videos into tables of snippets. Each snippet is represented by a set of moving and static patterns. The moving patterns are automatically detected and tracked, while the static patterns …


Indexing And Matching Of Polyphonic Songs For Query-By-Singing System, Tat-Wan Leung, Chong-Wah Ngo Oct 2004

Indexing And Matching Of Polyphonic Songs For Query-By-Singing System, Tat-Wan Leung, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

This paper investigates the issues in polyphonic popular song retrieval. The problems that we consider include singing voice extraction, melodic curve representation, and database indexing. Initially, polyphonic songs are decomposed into singing voices and instruments sounds in both time and frequency domains based on SVM and ICA. The extracted singing voices are represented as two melodic curves that model the statistical mean and neighborhood similarity of notes. To speed up the matching between songs and query, we further adopt proportional transportation distance to index the songs as vantage point trees. Encouraging results have been obtained through experiments.


A Robust Dissolve Detector By Support Vector Machine, Chong-Wah Ngo Nov 2003

A Robust Dissolve Detector By Support Vector Machine, Chong-Wah Ngo

Research Collection School Of Computing and Information Systems

In this paper, we propose a novel approach for the robust detection and classification of dissolve sequences in videos. Our approach is based on the multi-resolution representation of temporal slices extracted from 3D image volume. At the low-resolution (LR) scale, the problem of dissolve detection is reduced as cut transition detection. At the highresolution (HR) space, Gabor wavelet features are computed for regions that surround the cuts located at LR scale. The computed features are then input to support vector machines for pattern classification. Encouraging results have been obtained through experiments.


Synchronization Of Lecture Videos And Electronic Slides By Video Text Analysis, Feng Wang, Chong-Wah Ngo, Ting-Chuen Pong Nov 2003

Synchronization Of Lecture Videos And Electronic Slides By Video Text Analysis, Feng Wang, Chong-Wah Ngo, Ting-Chuen Pong

Research Collection School Of Computing and Information Systems

An essential goal of structuring lecture videos captured in live presentation is to provide a synchronized view of video clips and electronic slides. This paper presents an automatic approach to match video clips and slides based on the analysis of text embedded in lecture videos. We describe a method to reconstruct high-resolution video texts from multiple keyframes for robust OCRrecognition. A two-stage matching algorithm based on the title and content similarity measures between video clips and slides is also proposed.


Motion-Based Video Representation For Scene Change Detection, Chong-Wah Ngo, Ting-Chuen Pong, Hong-Jiang Zhang, Roland T. Chin Sep 2000

Motion-Based Video Representation For Scene Change Detection, Chong-Wah Ngo, Ting-Chuen Pong, Hong-Jiang Zhang, Roland T. Chin

Research Collection School Of Computing and Information Systems

We present a new ly developed scheme for automatical ly partitioning videos into scenes. A scene is general ly referred to as a group of shots taken place in the same site. In this paper, we first propose a motion annotation algorithm based on the analysis of spatiotemporal image volumes. The algorithm characterizes the motions within shots by extracting and analyzing the motion trajectories encoded in the temporal slices of image volumes. A motion-based keyframe computing and selection strategy is thus proposed to compactly represent the content of shots. With these techniques, we further present a scene change detection algorithm …