Depth Anything 3 ROS 2 Wrapper Development

GerdsenAI · November 17, 2025, 10:56pm

edited: fix link for wrapper repo

This is a repost from openrobotics.zulipchat.com that I made earlier, it was suggested I post here, so here goes…

So, over the past couple days we’ve been working on getting Depth Anything 3 (DA3 - the new monocular depth estimation model from ByteDance) running with ROS2. For those unfamiliar, Depth Anything 3 is basically a neural network that can estimate depth from a single camera image - no stereo rig or LiDAR needed. It’s pretty impressive compared to older methods like MiDaS.

Depth Anything V3 paper: [2408.02532] Singularity categories of rational double points in arbitrary characteristic
Official DA3 repo: GitHub - ByteDance-Seed/Depth-Anything-3: Depth Anything 3
Our GitHub DA3 ROS2 Wrapper Repo:

Here’s what the system looks like running on our Jetson: GitHub - GerdsenAI/GerdsenAI-Depth-Anything-3-ROS2-Wrapper: ROS2 wrapper for Depth Anything 3 (https://github.com/ByteDance-Seed/Depth-Anything-3)

You can see three terminals:

Left: USB camera node publishing at 640x480 @ 30 FPS
Middle: Depth estimation running with the colored depth output
Right: Depth viewer displaying the results

The depth visualization uses a turbo colormap (blue = close, red/orange = far). The quality is honestly better than we expected for monocular depth.

Platform: NVIDIA Jetson AGX Orin 64GB (Syslogic A4AGX64 industrial variant)

OS: Ubuntu 22.04 + JetPack 6.2.1
CUDA: 12.6
ROS2: Humble

Camera: Anker PowerConf C200 2K USB webcam

Running at 640x480 resolution
30 FPS output (though depth processing can’t keep up, feel free to help!! )

Software:

PyTorch 2.8.0 (Jetson-optimized from nvidia-ai-lab)
Depth Anything 3 SMALL model (25M parameters)
Standard v4l2_camera for USB input

Current Performance (This is Where We Need Help)

Here’s what we’re seeing:

Inference Performance:

FPS: 6.35 (way slower than we hoped)
Inference time: 153ms per frame
GPU utilization: 35-69%
RAM usage: ~6 GB (out of 64 GB available)

Is PyTorch the problem? We’re running standard PyTorch with CUDA. Would TensorRT conversion give us a significant speedup? Has anyone done DA3 → TensorRT on Jetson?

Memory bandwidth? Could we be hitting memory bandwidth limits moving tensors around?

Is the model just too big for real-time? The SMALL model is 25M params. Maybe we need to quantize to FP16 or INT8?

FP16 precision - The Ampere GPU supports FP16 tensor cores. Depth estimation might not need FP32 precision.

Optimize the preprocessing - Right now we’re doing image normalization and resizing in Python/PyTorch. Could we push this to GPU kernels?

Has anyone done any of this successfully? Especially interested if anyone’s gotten DA3 or similar transformers running fast on Jetson.

The paper claims real-time performance but they’re probably testing on desktop GPUs. Getting this fast on embedded hardware is the challenge.

But, we got it working, which is cool, but 6 FPS is pretty far from real-time for most robotics applications. We’re probably doing something obviously wrong or inefficient - this is our first attempt at deploying a transformer model on Jetson.

Contact: GerdsenAI · GitHub
License: MIT

Feel free to contribute!

GerdsenAI · November 24, 2025, 6:17pm

Just following up…. Has anyone had a chance to test the wrapper, or implement it into a project yet?

Phocidae · December 8, 2025, 10:44am

Hi there,

thank you very much for your contribution! Currently I am using da3 and your wrapper for my project, I try to apply it for obstacle detection and collide avoidance for my robot, I am running it on rtx 5060 and the result is very promising.

Though I am still wondering how you do the scale alignment? Could you share some experience with me about that? I am currently using piecewise scale alignment, i.e. I fit a linear function d_gt = a*d_mde+b for each depth interval. What is your opinion about that?

Looking forward to your reply, thank you

AljazJus · January 4, 2026, 3:18pm

Hi,

thank you for the repository, I have tested it out on Jetson Nano Orion (8GB RAM) and it works. But I’m trying to increase the performance and I’m having trouble with the optimisation script. Currently i get the forward pass takes around 0.12 s. I tried the scripts you have provided and had many problem some with loading the models and some with optimising. Do you have nay tips, i have ben able to quantise the model but any optimisation using TensorRT fails and couldn’t get ti to work.
Otherwise the package was very great and useful. Thanks

GerdsenAI · February 5, 2026, 3:28am

Sorry, just getting back to this @AljazJus . I have TensorRT working, it’s MUCH quicker. Will be making a post shortly.

GerdsenAI · February 5, 2026, 4:58pm

Great news for everyone following this project! We’ve successfully implemented TensorRT 10.3 acceleration, and the results are significant:

Performance Improvement

Metric	Before (PyTorch)	After (TensorRT)	Improvement
FPS	6.35	43+	6.8x faster
Inference Time	153ms	~23ms	6.6x faster
GPU Utilization	35-69%	85%+	More efficient

Test Platform: Jetson Orin NX 16GB (Seeed reComputer J4012), JetPack 6.2, TensorRT 10.3

Key Technical Achievement: Host-Container Split Architecture

We solved a significant Jetson deployment challenge - TensorRT Python bindings are broken in current Jetson container images (dusty-nv/jetson-containers#714). Our solution:

HOST (JetPack 6.x)
+--------------------------------------------------+
|  TRT Inference Service (trt_inference_shm.py)    |
|  - TensorRT 10.3, ~15ms inference                |
+--------------------------------------------------+
                    ↑
                    | /dev/shm/da3 (shared memory, ~8ms IPC)
                    ↓
+--------------------------------------------------+
|  Docker Container (ROS2 Humble)                  |
|  - Camera drivers, depth publisher               |
+--------------------------------------------------+

This architecture enables real-time TensorRT inference while keeping ROS2 in a clean container environment.

Call for Contributors

We’re looking for help with:

Test coverage for SharedMemory/TensorRT code paths
Validation on other Jetson platforms (AGX Orin, Orin Nano)
Point cloud generation (currently depth-only)

Repo: https://github.com/GerdsenAI/GerdsenAI-Depth-Anything-3-ROS2-Wrapper
License: MIT

@Phocidae @AljazJus - the TensorRT optimization should help significantly with your projects! Let me know if you run into any issues.

Topic		Replies	Views
[Release] GerdsenAI's Depth Anything 3 ROS2 Wrapper with Real-time TensorRT for Jetson ROS General release , ros2 , jetson	0	64	February 5, 2026
ROS2 Wrapper for DepthAnything V2 Computer Vision / Perception	0	1037	June 27, 2024
Depthai ROS2 driver Projects	6	4098	May 31, 2024
New ROS/ROS2 AI Packages and Docker Images for NVIDIA Jetson ROS General ros2 , embedded , foxy , wg-edgeai , deep-learning	7	7046	October 21, 2021
Edge AI WG Dec 3rd - NVIDIA's ROS2 AI dockers, Nav2 Mask R-CNN help needed ROS General wg-navigation , navigation , wg-edgeai , cyclonedds , jetson	1	1534	December 4, 2020

Depth Anything 3 ROS 2 Wrapper Development

Performance Improvement

Key Technical Achievement: Host-Container Split Architecture

Call for Contributors

Related topics