Open Access. Powered by Scholars. Published by Universities.®
Physical Sciences and Mathematics Commons™
Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 1 of 1
Full-Text Articles in Physical Sciences and Mathematics
Enhancing Visual Grounding In Vision-Language Pre-Training With Position-Guided Text Prompts, Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan
Enhancing Visual Grounding In Vision-Language Pre-Training With Position-Guided Text Prompts, Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan
Research Collection School Of Computing and Information Systems
Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning image and text pairs, paving the way for a wide range of cross-modal learning tasks. Nevertheless, we have observed that VLP models often fall short in terms of visual grounding and localization capabilities, which are crucial for many downstream tasks, such as visual reasoning. In response, we introduce a novel Position-guided Text Prompt ( PTP ) paradigm to bolster the visual grounding abilities of cross-modal models trained with VLP. In the VLP phase, PTP divides an image into N x N blocks and employs a widely-used object detector to identify objects …