COMPUTER VISION

Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only

September 29, 2023

Abstract

Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing pretrained vision-language (VL) model (e.g. CLIP vision encoder) to train open-vocabulary zero-shot semantic segmentation models. Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. We evaluate ZeroSeg on multiple popular segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO, in a zero-shot manner. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data, while also performing competitively compared to strongly supervised methods. Finally, we also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation, through both human studies and qualitative visualizations. The code is publicly available at https://github.com/facebookresearch/ZeroSeg

Download the Paper

AUTHORS

Written by

Jun Chen

Zhicheng Yan

Chenchen Zhu

Fanyi Xiao

Sean Chang Culatana

MOHAMED ELHOSEINY

Publisher

ICCV

Research Topics

Computer Vision

Related Publications

April 18, 2024

COMPUTER VISION

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

Jonas Kohler, Albert Pumarola, Edgar Schoenfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, Ali Thabet

April 18, 2024

March 20, 2024

COMPUTER VISION

SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Armen Avetisyan, Chris Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Julian Engel, Edward Miller, Richard Newcombe, Vasileios Balntas

March 20, 2024

February 13, 2024

GRAPHICS

COMPUTER VISION

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos

February 13, 2024

January 25, 2024

COMPUTER VISION

LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks

Felix Xu, Di Lin, Jianjun Zhao, Jianlang Chen, Lei Ma, Qing Guo, Wei Feng, Xuhong Ren

January 25, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.