Depth Anything 3 ROS 2 Wrapper Development

edited: fix link for wrapper repo

This is a repost from openrobotics.zulipchat.com that I made earlier, it was suggested I post here, so here goes…

So, over the past couple days we’ve been working on getting Depth Anything 3 (DA3 - the new monocular depth estimation model from ByteDance) running with ROS2. For those unfamiliar, Depth Anything 3 is basically a neural network that can estimate depth from a single camera image - no stereo rig or LiDAR needed. It’s pretty impressive compared to older methods like MiDaS.

Depth Anything V3 paper: [2408.02532] Singularity categories of rational double points in arbitrary characteristic
Official DA3 repo: GitHub - ByteDance-Seed/Depth-Anything-3: Depth Anything 3
Our GitHub DA3 ROS2 Wrapper Repo:

Here’s what the system looks like running on our Jetson: GitHub - GerdsenAI/GerdsenAI-Depth-Anything-3-ROS2-Wrapper: ROS2 wrapper for Depth Anything 3 (https://github.com/ByteDance-Seed/Depth-Anything-3)

You can see three terminals:

  • Left: USB camera node publishing at 640x480 @ 30 FPS
  • Middle: Depth estimation running with the colored depth output
  • Right: Depth viewer displaying the results

The depth visualization uses a turbo colormap (blue = close, red/orange = far). The quality is honestly better than we expected for monocular depth.

Platform: NVIDIA Jetson AGX Orin 64GB (Syslogic A4AGX64 industrial variant)

  • OS: Ubuntu 22.04 + JetPack 6.2.1
  • CUDA: 12.6
  • ROS2: Humble

Camera: Anker PowerConf C200 2K USB webcam

  • Running at 640x480 resolution
  • 30 FPS output (though depth processing can’t keep up, feel free to help!! :slight_smile: )

Software:

  • PyTorch 2.8.0 (Jetson-optimized from nvidia-ai-lab)
  • Depth Anything 3 SMALL model (25M parameters)
  • Standard v4l2_camera for USB input

Current Performance (This is Where We Need Help)

Here’s what we’re seeing:

Inference Performance:

  • FPS: 6.35 (way slower than we hoped)
  • Inference time: 153ms per frame
  • GPU utilization: 35-69%
  • RAM usage: ~6 GB (out of 64 GB available)

Is PyTorch the problem? We’re running standard PyTorch with CUDA. Would TensorRT conversion give us a significant speedup? Has anyone done DA3 → TensorRT on Jetson?

Memory bandwidth? Could we be hitting memory bandwidth limits moving tensors around?

Is the model just too big for real-time? The SMALL model is 25M params. Maybe we need to quantize to FP16 or INT8?

FP16 precision - The Ampere GPU supports FP16 tensor cores. Depth estimation might not need FP32 precision.

Optimize the preprocessing - Right now we’re doing image normalization and resizing in Python/PyTorch. Could we push this to GPU kernels?

Has anyone done any of this successfully? Especially interested if anyone’s gotten DA3 or similar transformers running fast on Jetson.

The paper claims real-time performance but they’re probably testing on desktop GPUs. Getting this fast on embedded hardware is the challenge.

But, we got it working, which is cool, but 6 FPS is pretty far from real-time for most robotics applications. We’re probably doing something obviously wrong or inefficient - this is our first attempt at deploying a transformer model on Jetson.

Contact: GerdsenAI · GitHub
License: MIT

Feel free to contribute! :slight_smile:

5 Likes

Just following up…. Has anyone had a chance to test the wrapper, or implement it into a project yet?

Hi there,

thank you very much for your contribution! Currently I am using da3 and your wrapper for my project, I try to apply it for obstacle detection and collide avoidance for my robot, I am running it on rtx 5060 and the result is very promising.

Though I am still wondering how you do the scale alignment? Could you share some experience with me about that? I am currently using piecewise scale alignment, i.e. I fit a linear function d_gt = a*d_mde+b for each depth interval. What is your opinion about that?

Looking forward to your reply, thank you :slight_smile:

Hi,

thank you for the repository, I have tested it out on Jetson Nano Orion (8GB RAM) and it works. But I’m trying to increase the performance and I’m having trouble with the optimisation script. Currently i get the forward pass takes around 0.12 s. I tried the scripts you have provided and had many problem some with loading the models and some with optimising. Do you have nay tips, i have ben able to quantise the model but any optimisation using TensorRT fails and couldn’t get ti to work.
Otherwise the package was very great and useful. Thanks :slight_smile:

Sorry, just getting back to this @AljazJus . I have TensorRT working, it’s MUCH quicker. Will be making a post shortly.

Great news for everyone following this project! We’ve successfully implemented TensorRT 10.3 acceleration, and the results are significant:

Performance Improvement

Metric Before (PyTorch) After (TensorRT) Improvement
FPS 6.35 43+ 6.8x faster
Inference Time 153ms ~23ms 6.6x faster
GPU Utilization 35-69% 85%+ More efficient

Test Platform: Jetson Orin NX 16GB (Seeed reComputer J4012), JetPack 6.2, TensorRT 10.3

Key Technical Achievement: Host-Container Split Architecture

We solved a significant Jetson deployment challenge - TensorRT Python bindings are broken in current Jetson container images (dusty-nv/jetson-containers#714). Our solution:

HOST (JetPack 6.x)
+--------------------------------------------------+
|  TRT Inference Service (trt_inference_shm.py)    |
|  - TensorRT 10.3, ~15ms inference                |
+--------------------------------------------------+
                    ↑
                    | /dev/shm/da3 (shared memory, ~8ms IPC)
                    ↓
+--------------------------------------------------+
|  Docker Container (ROS2 Humble)                  |
|  - Camera drivers, depth publisher               |
+--------------------------------------------------+

This architecture enables real-time TensorRT inference while keeping ROS2 in a clean container environment.

Call for Contributors

We’re looking for help with:

  • Test coverage for SharedMemory/TensorRT code paths

  • Validation on other Jetson platforms (AGX Orin, Orin Nano)

  • Point cloud generation (currently depth-only)

Repo: https://github.com/GerdsenAI/GerdsenAI-Depth-Anything-3-ROS2-Wrapper
License: MIT

@Phocidae @AljazJus - the TensorRT optimization should help significantly with your projects! Let me know if you run into any issues.