|
|
|
|
Paper Supplement Poster Code Dataset & Models |
|
Our method demonstrates significant improvements in monocular depth estimation for challenging conditions. Top row: RGB images from various real-world datasets used for testing. Middle row: Depth predictions by the state-of-the-art monocular depth network Depth Anything. Bottom row: Results from the same Depth Anything network after fine-tuning using our method, which leverages images generated by conditional diffusion models. Note the enhanced performance across various challenging scenarios including adverse weather, low-light conditions, and non-Lambertian surfaces. |
We present a novel approach designed to address the complexities posed by challenging, out-of-distribution data in the single-image depth estimation task. Our method leverages cutting-edge conditioned diffusion models with depth-aware controls to generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. By starting with images that facilitate depth prediction due to the absence of unfavorable factors, we systematically generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. This is achieved by leveraging cutting-edge conditioned diffusion models with depth-aware controls, known for their ability to synthesize high-quality image content from textual prompts while preserving the coherence of the 3D structure between generated and source imagery. Subsequent fine-tuning of any monocular depth network is carried out through a self-distillation protocol that takes into account images generated using our strategy and its own depth predictions on simple, unchallenging scenes. Experiments on benchmarks tailored for our purposes demonstrate the effectiveness and versatility of our proposal. |
|
1 - Scene Generation with Diffusion ModelsWe start with images that are easy for depth estimation (e.g., clear daylight scenes). Using state-of-the-art text-to-image diffusion models with multi-modal controls (e.g. T2I-Adapter, ControlNet, etc), we transform these into challenging scenarios while preserving the underlying 3D structure. 2 - Depth Estimation on Simple ScenesWe use a pre-trained monocular depth network (e.g. DPT, ZoeDepth, Depth-Anything) to estimate depth for the original, unchallenging scenes. This provides us with reliable depth estimates for the easy scenarios. 3 - Self-Distillation ProtocolWe then fine-tune the depth network using a self-distillation protocol. This process involves using the generated challenging images as input, employing the depth estimates from the simple scenes as pseudo ground truth, and applying a scale-and-shift-invariant loss. |
Our approach generates datasets for various challenging conditions. Some examples are:
|
Our method demonstrates the ability to generate diverse driving scenes with challenging weather conditions. By leveraging diffusion models, we can create a wide range of adverse weather scenarios, allowing for comprehensive training and evaluation of depth estimation models in diverse environmental conditions.
|
This figure illustrates our process of transforming easy scenes with opaque objects into challenging ones with transparent and mirrored surfaces. Text-to-image diffusion models with depth-aware control make it possible to preserve the underlying 3D structure while altering the visual appearance, creating complex scenarios for depth estimation that are traditionally difficult to capture or annotate.
|
|
RGB
Depth Anything
Depth Anything (Ours)
|
|
RGB
DPT
DPT (Ours)
|
|
RGB
Depth Anything
Depth Anything (Ours)
|
|
RGB
DPT
DPT (Ours)
Ground Truth
|
|
|
|
@inproceedings{tosi2024diffusion,
title = {Diffusion Models for Monocular Depth Estimation: {Overcoming} Challenging Conditions},
author = {Tosi, Fabio and {Zama Ramirez}, Pierluigi and Poggi, Matteo},
booktitle = {European Conference on Computer Vision ({ECCV})},
year = {2024}
}