AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Mercedes-Benz AG1 University of Tübingen2 Tübingen AI Center3
University of Bonn4 RPL, KTH Royal Institute of Technology5 TU Berlin6
Accepted to ICCV 2025
AGO Teaser Figure
In the open-world 3D semantic occupancy prediction: (a) Supervision based on pseudo labels with fixed classes struggles to predict novel categories, such as the “construction vehicle” and “vegetation”. (b) Similarity-based alignment suffers from significant mismatches due to issues like modality discrepancy, leading to confusions between e.g. “sidewalk” and “driveable surface”. (c) Our proposed Adaptive Grounding flexibly accommodates both known and unknown objects, achieving more precise open world occupancy prediction.

Abstract

Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.

Motivation

3D semantic occupancy prediction is central to scene understanding for autonomous driving, yet traditional approaches:
  ▸ heavily rely on extensive manual 3D annotations
  ▸ are constrained by predefined closed semantic spaces
Existing VLM-based methods:
  ▸ rely on fixed-class pseudo-labels → struggles to predict novel classes
  ▸ base on image-text alignment → suffers from severe mismatches due to issues like modality gaps

Modality Gaps

Goal: Enable open-world 3D semantic occupancy prediction with flexible adaptation to unknowns.


Method

The upper part illustrates the generation process of 3D pseudo-labels and image embeddings based on pre-trained VLMs during training. The lower part depicts the main architecture of our AGO framework, which comprises a frozen pre-trained text encoder, a vision-centric 3D encoder, a modality adapter, and an open-world identifier. The detailed illustration in the middle showcases our training paradigm, which consists of a noise-augmented grounding training and an adaptive image embedding alignment.

AGO architecture
AGO Architecture

Quantitative Results

Closed-world benchmark
3D occupancy prediction performance under the self-supervised setting on the Occ3D-nuScenes dataset.

In closed-world scenarios, AGO demonstrates substantial improvements across both static and dynamic categories.

Open-world Benchmark
3D occupancy prediction performance under the open-world setting on the Occ3D-nuScenes dataset.

In open-world scenes, AGO exhibits superior zero-shot performance while rapidly adapting to novel categories with only a few shots.


Qualitative Results

Closed-world Visualization
Closed-world Visualization

Open-world Visualization
Open-world Visualization

Poster

BibTeX

@article{li2025ago,
  title={AGO: Adaptive Grounding for Open World 3D Occupancy Prediction},
  author={Li, Peizheng and Ding, Shuxiao and Zhou, You and Zhang, Qingwen and Inak, Onat and Triess, Larissa and Hanselmann, Niklas and Cordts, Marius and Zell, Andreas},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  year={2025}
}