AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Li, Peizheng; Ding, Shuxiao; Zhou, You; Zhang, Qingwen; Inak, Onat; Triess, Larissa; Hanselmann, Niklas; Cordts, Marius; Zell, Andreas

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Peizheng Li^1,2, Shuxiao Ding^1,4, You Zhou¹, Qingwen Zhang⁵, Onat Inak^1,6, Larissa Triess¹, Niklas Hanselmann^1,2,3, Marius Cordts¹, Andreas Zell²,

Mercedes-Benz AG¹ University of Tübingen² Tübingen AI Center³
University of Bonn⁴ RPL, KTH Royal Institute of Technology⁵ TU Berlin⁶
Accepted to ICCV 2025

Paper Supplementary Code arXiv

AGO Teaser Figure — In the open-world 3D semantic occupancy prediction: (a) Supervision based on pseudo labels with fixed classes struggles to predict novel categories, such as the “construction vehicle” and “vegetation”. (b) Similarity-based alignment suffers from significant mismatches due to issues like modality discrepancy, leading to confusions between e.g. “sidewalk” and “driveable surface”. (c) Our proposed Adaptive Grounding flexibly accommodates both known and unknown objects, achieving more precise open world occupancy prediction.

Abstract

Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.

Motivation

3D semantic occupancy prediction is central to scene understanding for autonomous driving, yet traditional approaches:
  ▸ heavily rely on extensive manual 3D annotations
  ▸ are constrained by predefined closed semantic spaces
Existing VLM-based methods:
  ▸ rely on fixed-class pseudo-labels → struggles to predict novel classes
  ▸ base on image-text alignment → suffers from severe mismatches due to issues like modality gaps

Goal: Enable open-world 3D semantic occupancy prediction with flexible adaptation to unknowns.

Method

The upper part illustrates the generation process of 3D pseudo-labels and image embeddings based on pre-trained VLMs during training. The lower part depicts the main architecture of our AGO framework, which comprises a frozen pre-trained text encoder, a vision-centric 3D encoder, a modality adapter, and an open-world identifier. The detailed illustration in the middle showcases our training paradigm, which consists of a noise-augmented grounding training and an adaptive image embedding alignment.

Quantitative Results

Closed-world benchmark — 3D occupancy prediction performance under the self-supervised setting on the Occ3D-nuScenes dataset.

In closed-world scenarios, AGO demonstrates substantial improvements across both static and dynamic categories.

Open-world Benchmark — 3D occupancy prediction performance under the open-world setting on the Occ3D-nuScenes dataset.

In open-world scenes, AGO exhibits superior zero-shot performance while rapidly adapting to novel categories with only a few shots.

Qualitative Results

Poster

BibTeX

@article{li2025ago,
  title={AGO: Adaptive Grounding for Open World 3D Occupancy Prediction},
  author={Li, Peizheng and Ding, Shuxiao and Zhou, You and Zhang, Qingwen and Inak, Onat and Triess, Larissa and Hanselmann, Niklas and Cordts, Marius and Zell, Andreas},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  year={2025}
}

More Works from Our Lab

PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird's-Eye View

SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving

TQD-Track: Temporal Query Denoising for 3D Multi-Object Tracking

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Abstract

Motivation

Method

Quantitative Results

Qualitative Results

Poster

BibTeX