Diffusion Models for

Monocular Depth Estimation

Overcoming Challenging Conditions

ECCV 2024


Fabio Tosi
Pierluigi Zama Ramirez
Matteo Poggi

University of Bologna
Paper Supplement Poster Code Dataset & Models

Our method demonstrates significant improvements in monocular depth estimation for challenging conditions. Top row: RGB images from various real-world datasets used for testing. Middle row: Depth predictions by the state-of-the-art monocular depth network Depth Anything. Bottom row: Results from the same Depth Anything network after fine-tuning using our method, which leverages images generated by conditional diffusion models. Note the enhanced performance across various challenging scenarios including adverse weather, low-light conditions, and non-Lambertian surfaces.




Abstract

We present a novel approach designed to address the complexities posed by challenging, out-of-distribution data in the single-image depth estimation task. Our method leverages cutting-edge conditioned diffusion models with depth-aware controls to generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. By starting with images that facilitate depth prediction due to the absence of unfavorable factors, we systematically generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. This is achieved by leveraging cutting-edge conditioned diffusion models with depth-aware controls, known for their ability to synthesize high-quality image content from textual prompts while preserving the coherence of the 3D structure between generated and source imagery. Subsequent fine-tuning of any monocular depth network is carried out through a self-distillation protocol that takes into account images generated using our strategy and its own depth predictions on simple, unchallenging scenes. Experiments on benchmarks tailored for our purposes demonstrate the effectiveness and versatility of our proposal.



Method


1 - Scene Generation with Diffusion Models

We start with images that are easy for depth estimation (e.g., clear daylight scenes). Using state-of-the-art text-to-image diffusion models with multi-modal controls (e.g. T2I-Adapter, ControlNet, etc), we transform these into challenging scenarios while preserving the underlying 3D structure.

2 - Depth Estimation on Simple Scenes

We use a pre-trained monocular depth network (e.g. DPT, ZoeDepth, Depth-Anything) to estimate depth for the original, unchallenging scenes. This provides us with reliable depth estimates for the easy scenarios.

3 - Self-Distillation Protocol

We then fine-tune the depth network using a self-distillation protocol. This process involves using the generated challenging images as input, employing the depth estimates from the simple scenes as pseudo ground truth, and applying a scale-and-shift-invariant loss.




Generated Dataset

Our approach generates datasets for various challenging conditions. Some examples are:

  • Adverse weather (rain, snow, fog)
  • Low-light conditions
  • Transparent and Mirror (ToM) surfaces


Arbitrary Weather Conditions in Driving Scenes

Our method demonstrates the ability to generate diverse driving scenes with challenging weather conditions. By leveraging diffusion models, we can create a wide range of adverse weather scenarios, allowing for comprehensive training and evaluation of depth estimation models in diverse environmental conditions.


Transforming Opaque Materials into ToM Surfaces

This figure illustrates our process of transforming easy scenes with opaque objects into challenging ones with transparent and mirrored surfaces. Text-to-image diffusion models with depth-aware control make it possible to preserve the underlying 3D structure while altering the visual appearance, creating complex scenarios for depth estimation that are traditionally difficult to capture or annotate.



Qualitative Results

Performance on the Booster Dataset

RGB Depth Anything Depth Anything (Ours)

Performance on the ClearGrasp Dataset

RGB DPT DPT (Ours)

Performance on Web Images

RGB Depth Anything Depth Anything (Ours)

Performance on the DrivingStereo Dataset

RGB DPT DPT (Ours) Ground Truth

Quantitative Results

Performance on the nuScenes Dataset


Cross-Dataset Generalization


Performance on Transparent and Mirror (ToM) Objects



Citation

@inproceedings{tosi2024diffusion,
		title     = {Diffusion Models for Monocular Depth Estimation: {Overcoming} Challenging Conditions},
		author    = {Tosi, Fabio and {Zama Ramirez}, Pierluigi and Poggi, Matteo},
		booktitle = {European Conference on Computer Vision ({ECCV})},
		year      = {2024}
	}