AGO: Adaptive Grounding for Open World 3D Occupancy Prediction
Abstract
Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.
Motivation
3D semantic occupancy prediction is central to scene understanding for autonomous driving, yet traditional approaches:
▸ heavily rely on extensive manual 3D annotations
▸ are constrained by predefined closed semantic spaces
Existing VLM-based methods:
▸ rely on fixed-class pseudo-labels → struggles to predict novel classes
▸ base on image-text alignment → suffers from severe mismatches due to issues like modality gaps

Goal: Enable open-world 3D semantic occupancy prediction with flexible adaptation to unknowns.
Method
The upper part illustrates the generation process of 3D pseudo-labels and image embeddings based on pre-trained VLMs during training. The lower part depicts the main architecture of our AGO framework, which comprises a frozen pre-trained text encoder, a vision-centric 3D encoder, a modality adapter, and an open-world identifier. The detailed illustration in the middle showcases our training paradigm, which consists of a noise-augmented grounding training and an adaptive image embedding alignment.

Quantitative Results

In closed-world scenarios, AGO demonstrates substantial improvements across both static and dynamic categories.

In open-world scenes, AGO exhibits superior zero-shot performance while rapidly adapting to novel categories with only a few shots.
Qualitative Results


Poster
BibTeX
@article{li2025ago,
title={AGO: Adaptive Grounding for Open World 3D Occupancy Prediction},
author={Li, Peizheng and Ding, Shuxiao and Zhou, You and Zhang, Qingwen and Inak, Onat and Triess, Larissa and Hanselmann, Niklas and Cordts, Marius and Zell, Andreas},
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
year={2025}
}