Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Programming Languages and Compilers

PDF

Research Collection School Of Computing and Information Systems

2024

Fill-in-the-blank

Articles 1 - 1 of 1

Full-Text Articles in Physical Sciences and Mathematics

Enhancing Visual Grounding In Vision-Language Pre-Training With Position-Guided Text Prompts, Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan May 2024

Enhancing Visual Grounding In Vision-Language Pre-Training With Position-Guided Text Prompts, Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

Research Collection School Of Computing and Information Systems

Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning image and text pairs, paving the way for a wide range of cross-modal learning tasks. Nevertheless, we have observed that VLP models often fall short in terms of visual grounding and localization capabilities, which are crucial for many downstream tasks, such as visual reasoning. In response, we introduce a novel Position-guided Text Prompt ( PTP ) paradigm to bolster the visual grounding abilities of cross-modal models trained with VLP. In the VLP phase, PTP divides an image into N x N blocks and employs a widely-used object detector to identify objects …