Panoptic Lifting for 3D Scene Understanding with Neural Fields

CVPR 2023 Highlight

Yawar Siddiqui^1,2, Lorenzo Porzi², Samuel Rota Bulò², Norman Müller^1,2, Matthias Nießner¹, Angela Dai¹, Peter Kontschieder²

¹Technical University of Munich, ²Meta Reality Labs

(Work was done during Yawar’s and Norman’s internships at Meta Reality Labs Zurich as well as at TUM)

Paper arXiv Video Code

Given only RGB images of an in-the-wild scene as input, Panoptic Lifting optimizes a panoptic radiance field which can be queried for color, depth, semantics, and instances for any point in space. Our method lifts noisy and view-inconsistent machine generated 2D segmentation masks into a consistent 3D panoptic radiance field, without requiring further tracking supervision or 3D bounding boxes.

Neural Field Optimization

Volumetric Rendering

Nearest RGB frames and their machine generated 2D semantic and instance labels

Optimized panoptic field representation, i.e. color, semantics, and instances with the queried viewpoint (shown as a camera)

Rendered color, depth, semantics and instances from novel viewpoint

Use the slider to move across a novel view trajectory along the scene. For a selected viewpoint along the trajectory, the nearest input frames and their machine generated panoptics are shown on the left. The optimized volumetric representation is shown as a mesh in the middle along with the viewpoint. Novel view panoptics obtained by our method are shown on the right. Press R to reset mesh view.

Abstract

We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with 3D-consistent panoptic segmentation from novel viewpoints.

Unlike existing approaches which use 3D input directly or indirectly, our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network. Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-view consistent, 3D panoptic representation of the scene. To account for inconsistencies of 2D instance identifiers across views, we solve a linear assignment with a cost based on the model's current predictions and the machine-generated segmentation masks, thus enabling us to lift 2D instances to 3D in a consistent way. We further propose and ablate contributions that make our method more robust to noisy, machine-generated labels, including test-time augmentations for confidence estimates, segment consistency loss, bounded segmentation fields, and gradient stopping.

Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets, improving by 8.4, 13.8, and 10.6% in scene-level PQ over state of the art.

Video

Results

Scene Editing

Once trained, our method can generate novel views of a scene with object instances deleted, duplicated or manipulated under affine transformations.

Optimized Scene

Deletion

Duplication

Manipulation

Effect of Components

We use a combination of ideas to impart robustness against noisy 2D machine generated labels.

Related Links

For more work on similar tasks, please check out

Semantic-NeRF extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, given ground-truth (possibly sparse) semantic annotations in addition to RGB images.

Panoptic Neural Fields propose an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff), using machine generated object bounding boxes and machine generated semantic labels.

DM-NeRF tackles scene decomposition by optimizing an object identifier field for a scene given instance annotations for input frames.

Panoptic NeRF tackles a label transfer task for a scene given a coarse panoptically segmented mesh and machine generated 2D semantic segmentations.

NeSF produces 3D semantic fields from posed RGB images alone, generalizing over novel scenes.

BibTeX

@InProceedings{Siddiqui_2023_CVPR,
    author    = {Siddiqui, Yawar and Porzi, Lorenzo and Bul\`o, Samuel Rota and M\"uller, Norman and Nie{\ss}ner, Matthias and Dai, Angela and Kontschieder, Peter},
    title     = {Panoptic Lifting for 3D Scene Understanding With Neural Fields},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {9043-9052}
}