Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Vision Faculty Publications

Semantics

Publication Year

Articles 1 - 6 of 6

Full-Text Articles in Physical Sciences and Mathematics

Intermediate Prototype Mining Transformer For Few-Shot Semantic Segmentation, Yuanwei Liu, Nian Liu, Xiwen Yao, Junwei Han Dec 2022

Intermediate Prototype Mining Transformer For Few-Shot Semantic Segmentation, Yuanwei Liu, Nian Liu, Xiwen Yao, Junwei Han

Computer Vision Faculty Publications

Few-shot semantic segmentation aims to segment the target objects in query under the condition of a few annotated support images. Most previous works strive to mine more effective category information from the support to match with the corresponding objects in query. However, they all ignored the category information gap between query and support images. If the objects in them show large intra-class diversity, forcibly migrating the category information from the support to the query is ineffective. To solve this problem, we are the first to introduce an intermediate prototype for mining both deterministic category information from the support and adaptive …


Transresnet: Integrating The Strengths Of Vits And Cnns For High Resolution Medical Image Segmentation Via Feature Grafting, Muhammad Hamza Sharif, Dmitry Demidov, Asif Hanif, Mohammad Yaqub, Min Xu Nov 2022

Transresnet: Integrating The Strengths Of Vits And Cnns For High Resolution Medical Image Segmentation Via Feature Grafting, Muhammad Hamza Sharif, Dmitry Demidov, Asif Hanif, Mohammad Yaqub, Min Xu

Computer Vision Faculty Publications

High-resolution images are preferable in medical imaging domain as they significantly improve the diagnostic capability of the underlying method. In particular, high resolution helps substantially in improving automatic image segmentation. However, most of the existing deep learning-based techniques for medical image segmentation are optimized for input images having small spatial dimensions and perform poorly on high-resolution images. To address this shortcoming, we propose a parallel-in-branch architecture called TransResNet, which incorporates Transformer and CNN in a parallel manner to extract features from multi-resolution images independently. In TransResNet, we introduce Cross Grafting Module (CGM), which generates the grafted features, enriched in both …


Face Pyramid Vision Transformer, Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood Nov 2022

Face Pyramid Vision Transformer, Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood

Computer Vision Faculty Publications

A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT …


Cmr3d: Contextualized Multi-Stage Refinement For 3d Object Detection, Dhanalaxmi Gaddam, Jean Lahoud, Fahad Shahbaz Khan, Rao Anwer, Hisham Cholakkal Sep 2022

Cmr3d: Contextualized Multi-Stage Refinement For 3d Object Detection, Dhanalaxmi Gaddam, Jean Lahoud, Fahad Shahbaz Khan, Rao Anwer, Hisham Cholakkal

Computer Vision Faculty Publications

Existing deep learning-based 3D object detectors typically rely on the appearance of individual objects and do not explicitly pay attention to the rich contextual information of the scene. In this work, we propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework, which takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene at multiple levels to predict a set of object bounding-boxes along with their corresponding semantic labels. To this end, we propose to utilize a context enhancement network that captures the contextual information at different levels of granularity followed by a …


Cocoa: Context-Conditional Adaptation For Recognizing Unseen Classes In Unseen Domains, Puneet Mangla, Shivam Chandhok, Vineeth N. Balasubramanian, Fahad Shahbaz Khan Feb 2022

Cocoa: Context-Conditional Adaptation For Recognizing Unseen Classes In Unseen Domains, Puneet Mangla, Shivam Chandhok, Vineeth N. Balasubramanian, Fahad Shahbaz Khan

Computer Vision Faculty Publications

Recent progress towards designing models that can generalize to unseen domains (i.e domain generalization) or unseen classes (i.e zero-shot learning) has embarked interest towards building models that can tackle both domain-shift and semantic shift simultaneously (i.e zero-shot domain generalization). For models to generalize to unseen classes in unseen domains, it is crucial to learn feature representation that preserves class-level (domain-invariant) as well as domain-specific information. Motivated from the success of generative zero-shot approaches, we propose a feature generative framework integrated with a COntext COnditional Adaptive (COCOA) Batch-Normalization layer to seamlessly integrate class-level semantic and domain-specific information. The generated visual features …


Multi-Modal Transformers Excel At Class-Agnostic Object Detection, Muhammad Maaz, Hanoona Bangalath Rasheed, Salman Hameed Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang Nov 2021

Multi-Modal Transformers Excel At Class-Agnostic Object Detection, Muhammad Maaz, Hanoona Bangalath Rasheed, Salman Hameed Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

Computer Vision Faculty Publications

What constitutes an object? This has been a longstanding question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and for unseen objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. To bridge this gap, we explore recent Multi-modal Vision Transformers (MViT) that have been trained with aligned image-text pairs. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on …