I wanted to ask the mighty about some thoughts about low latency teleoperation of a robot. Here’s my scenario: Consider a robot with 3 video cameras. The camera stream with 30 FPS with HD quality resolution. The video streams needs to be rectified and send H264 or H265 via UDP to a control station with as few latency as possible.
So far I used gscam2 and image_proc’s rectify node to get a rectified image. I use raw image topics to save some overhead between the nodes. A third node uses CVBridge, OpenCV and GStreamer to encode the image and send it via UDP to a receiving GStreamer Pipeline on the control station. I noticed a latency of at least 300ms (glas to glas) in my setup and was wondering what could be conceptually improved:
Is it in general a good idea to process such an image stream with ROS2 in this scenario? Alternatively, I could also directly send it (of course unrectified) with Gstreamer and spare some overheads.
What could be improved on the ROS2 side? My sending node is written in C++ and I use hardware acceleration for the encoding.
Are the any examples for low latency ROS2 teleoperation scenarios? All my systems are in the same network.
You forgot to mention the most important thing: how is the camera connected? And what is the network layer (I hope at least 1 Gbps Ethernet).
For lowest latency, you’d use SDI protocol. We tested many last year (RTP, RTSP, MPEG-DASH, probably more), but only SDI could provide super low latencies. The problem is that SDI is proprietary and works mostly with Windows.
If you need to go the ROS way, I’d concentrate on the whole path of the image. Does the camera driver load it directly in GPU memory? Or could it? Does the rectify node work on GPU? (spoiler alert: image_proc does not). Can the encoder take the image directly from the GPU memory? In extreme cases: would it be possible to directly send the UDP packets from GPU memory to the network card via some DMA? Also, do you have good config of the encoder for real time streaming? That can make a lot. The same for the decoder.
Also, if the image is not read all at once from the camera, you could try rectifying it line by line as they come from the driver.
In any case, I fear that the publicly available ROS packages will not help you with this too much.
The network layer is 1 Gbps, I can confirm that. As for the camera connection, so far I tested it with cheap USB webcams (the better ones are still on the way). These will definitely also be responsible for a certain amount of latency, but I’m more curious, what I can already now optimize on the software side, while I’m waiting for the hardware. Cool, these hints are very helpful for further queries, thanks for that.
I’d like to go the ROS way, as once the images are within the ROS system, I’d love to further extend the robot from purely teleoperated to also partial autonomous, e.g. by using the camera streams for navigation or perception of obstacles.
How to get the video to the controller (usually a web browser) with low latency and accounting for bad network (bandwidth fluctuation, packet loss) and ideally without requiring a VPN.
How to bake safety into the system such that lag in receiving the video will automatically prevent control (which would be based on outdated situation awareness).
The issue in your setup is that it (presumably) doesn’t do congestion control (monitor available network bandwidth and reduce bitrate when necessary to avoid lag) nor packet loss mitigation.
I think WebRTC should greatly solve the problem of getting the stream to especially to a webbased controller app. Upstream of it, would you still recommend to go through ROS2? Where would you see the image rectification process with WebRTC?
200 ms was from USB camera to browser. But going through ROS should only be marginally more given that ROS image topics are just full image frames, i.e., producing these from the cameras does not involve processing of any kind that would require a jitterbuffer or similar that would add latency.
may I ask what resolution? Did you use any hardware acceleration? I saw that WebRTC can only decode H264 but not H265. Did you see any impact here? Can you give any recommendation to the onboarding processing before handing it over to WebRTC? I need to at least rectify the image and would love to got with image_proc, as it’s available of the shelf. I havent measured how much latency this adds, as I’m still waiting for the final hardware.
Resolution: doesn’t affect latency, but we’ve seen users stream four 1920x1080 cameras at the same time.
Hardware acceleration: yes, our implementation supports it (Nvidia Jetson/Orin, Nvidia GPUs, RockChip, and Intel VA-API) and I recommend using it, but it doesn’t affect latency either.
H264 vs H265: webRTC support either (plus VP8, VP9, anv AV1) but H264 is better for this application than H265. The latter gives you better image quality at the same bitrate, but costs more to encode and decode I think. AV1 would be best, but isn’t well supported by browsers yet.
Rectifying: using image_proc in ROS should work fine. Since, again, this is a per-image operation, it shouldn’t add much latency (my guess would be less than 20ms).
But you can try all this out yourself if you want. Our remote-teleop capability only takes a few minutes to set up and it automatically finds available ROS topics on your robot. DM me if you need help.
Hey — we’ve spent a lot of time chasing exactly this problem at Adamo (full disclosure we ship a teleop stack: adamohq.com), so a few things worth trying:
The 300ms is almost certainly not in the encoder. On HD@30 with a hardware encoder you should be looking at <10ms encode. The usual suspects in that 300ms budget are, in rough order:
CVBridge + raw image topics. Even on raw, you’re doing at least one copy per node hop, plus DDS serialization overhead. For three 1080p@30 streams that’s a lot of memory bandwidth and scheduler latency. If you stay in ROS2, switch to intra-process composition and put gscam2 → rectify → encoder in a single component container with use_intra_process_comms=true. That gets you zero-copy Image messages between nodes in the same process. If you can drop to a single process entirely, even better — we use iceoryx2 for true zero-copy IPC between capture and encode, which avoids DDS entirely.
Rectification on CPU.image_proc/rectify is usually CPU-bound and that alone can add 20–50ms per frame at HD. If you’re on Jetson, do rectification on the GPU (NPP remap or a small CUDA kernel). If you’re on a discrete GPU, you can rectify inside the NVENC pipeline using NPP and never round-trip to system memory.
The encoder itself. Make sure you’re set up for sub-frame latency:
tune=zerolatency, preset=ultrafast (x264) or equivalent NVENC low-latency preset
No B-frames (bframes=0)
Infinite GOP + on-demand keyframes (request a keyframe when the receiver reports loss, don’t blast one every 2s)
Slice-based encoding so you can emit partial frames
The receiver’s jitter buffer is the silent killer. A fixed 100ms buffer adds 100ms whether you need it or not. On a LAN, jitter is microseconds — you can run the buffer at 0 or use an adaptive target (EWMA(inter-frame jitter) * multiplier). GStreamer’s rtpjitterbuffer defaults are tuned for the public internet and will eat your latency budget on a LAN.
Bare UDP is fine on a LAN, but it’s a dead end the moment the link gets lossy. If you ever want this to work over LTE or wifi, you’ll want congestion control. We’ve moved to MoQ over QUIC for exactly this reason — you get a real congestion controller, multi-region relay, and the encoder can react to network state. Zenoh’s another option if you want pub/sub semantics.
On your meta-question: is ROS2 the right pipe? For the control side (commands, state, low-rate telemetry) - yes, absolutely. For the video path - increasingly people are bypassing it. The DDS serialization tax on raw HD frames is real, and once you’ve left ROS2 for the video transport you don’t get much back from putting it back in. We run video on Zenoh/MoQ and keep ROS2 for everything else.
If you want a reference: our robot-side stack (Rust + GStreamer, shm capture via iceoryx2, MoQ transport) is roughly what we landed on after chasing this same number down. Happy to share more if useful.