Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Artificial Intelligence and Robotics

PDF

Research Collection School Of Computing and Information Systems

2023

Computational linguistics

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Modularized Zero-Shot Vqa With Pre-Trained Models, Rui Cao, Jing Jiang Jul 2023

Modularized Zero-Shot Vqa With Pre-Trained Models, Rui Cao, Jing Jiang

Research Collection School Of Computing and Information Systems

Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA).Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose …


Cone: An Efficient Coarse-To-Fine Alignment Framework For Long Video Temporal Grounding, Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Mike Z. Shou, Nan. Duan Jul 2023

Cone: An Efficient Coarse-To-Fine Alignment Framework For Long Video Temporal Grounding, Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Mike Z. Shou, Nan. Duan

Research Collection School Of Computing and Information Systems

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a …