Telegraf_resource_monitor: System Monitoring Node for ROS 2

As a thank you to the ROS community for creating a great framework that empowers everyone to learn and implement robotics, I’ve created telegraf_resource_monitor. It is a ROS 2 package that integrates Telegraf monitoring with ROS 2 to provide comprehensive system resource monitoring for robotic applications.

Why This Matters

ROS 2 currently seems to lack a well-established system monitoring solution. Questions like “which node is consuming too much memory?” or “Is my robot overheating?” are critical for deployments but are not easily answered today it seems. This package aims to answer these questions once and for all!

What It Does

Instead of reinventing monitoring tools, this package leverages the cloud native Telegraf and bridges it to ROS 2, allowing for live monitoring and bagging for later inspection.
Its features are:

  • Dynamic topic creation - automatically creates topics like based on available metrics

  • Zero ROS configuration - adapts to whatever Telegraf provides so only Telegraf configuration is needed.

  • Rich monitoring - currently CPU, memory, disk, temperature, and ROS node resource usage but easily extendable using Telegraf’s 100+ input plugins and ability to interface with custom applications.

  • Reliable - Telegraf’s use in cloud environments proves its reliability.

Get Started

The package is available on GitHub with full documentation, examples, and configuration details:

:link: https://github.com/Bart-van-Ingen/telegraf_resource_monitor

It also has a visual studio code development container, making it easy to test on you computer without needing to install it.

I am excited to see what you think ! Feedback and contributions welcome!

2 Likes

This looks quite interesting. How complicated is telegraf to set up?

However, I really, really dislike the ROS interface the package provides. custom_interfaces package? What’s that? It follows none of the package naming conventions (ok, it ends in _interfaces). Moreover, it seems to be just another replacement for diagnostic_msgs/DiagnosticStatus (yes, I know the match here is not 100%).

And if you know you’re reading temperatures from sensors, why not publish them as sensor_msgs/Temperature message?

Last, the number of topics the package creates can get quite high. Each CPU core on its own topic? Why is that? Each topic in the system consumes some resources (which is much more true for ROS 2 than for ROS 1).

2 Likes

Hi @peci1, thanks for taking a look !

There are multiple ways to install it, with the one I think is easiest being downloading the standalone binary and placing it /usr/local/bin/ .

Thats some heavy emotions for a package name! :wink: I agree this is not informative enough. I will change it! Would resource_monitoring_interfaces be sufficient?

I debated using diagnostic messages, but decided not to due to the semantics linked to ROS diagnostics, basing a lot of it on the presentation of Christian Henkel at Roscon last year.
The following quotes are what made me think that using diagnostics was not appropriate:

  • Main purpose: Observe the current state of the robot
  • In general, diagnostics are not meant to be used functionally
  • Diagnostics are meant as a communication method from robot to human

This package doesn’t assume on how the user is using the information, it could be through diagnostics but I believe that some of the resources monitored could also be acted upon functionally.

Also according to REP 207, I would need to send on the topic /diagnostics using the diagnostic_msgs/DiagnosticArray data type. I would then also need to specify operational levels (OK, WARN, ERROR), which i think depends on the case.
These are the reasons why I didnt go for the diagnostics route, but I am definitely open to discussing it further!

That could be possible! I could parse to determine if it is temperature or not and use the Temperature message instead. However currently this is an example of what the fields look like for a temperature topic:

fields:
- name: temp_crit
  value: 125.0
- name: temp_input
  value: 56.0

so you would lose the temp_crit information, which is not in the Temperature message definition.

This is a setting that I have set in config/telegraf.conf.

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true

Setting percpu = false will remove the per cpu topics! Telegraf seems to be super configurable so can be tuned to your heart’s delight!

1 Like

Great, thanks! I think this name is much better.

Sure, as I said the match is not 100%. We used GitHub - ethz-asl/ros-system-monitor: System monitoring tools for ROS. on ROS 1 which publishes to diagnostics, and we were quite happy with it. But the package was kind of mixing collection of data with interpretation. I understand your package is only for collection. But it would be great to think about the downstream task of generating basic diagnostics from the telegraf ROS topics. Maybe a simple interpretation package could also be a part of the repo.

You can publish both your custom message and Temperature message.

I have finally had some time to work on this package a little more!
@peci1 had a good idea of adding the capabilities to have the resource monitoring plug into the diagnostics system of ROS, so I set out to make it.

I have just updated the repository with the resource_diagnostics_updater package that implements this new feature. I have made it such that you only need to define what resources you want diagnostics applied to in a configuration file that you can feed the node during launch. It will take the defined resource topics and apply thresholds to determine the diagnostics level, where after they are published on /diagnostics .

have a look and let me know what you think!

1 Like

To be honest, initially I discarded the package upon reading the README.md.

But re-reading it again I also do recognize the potential. So I think it’s only fair to mention what put me off initially since you are asking for feedback :slightly_smiling_face:

Custom installation instructions

# Add InfluxDB repository
curl -s https://repos.influxdata.com/influxdata-archive_compat.key | sudo apt-key add -
echo "deb https://repos.influxdata.com/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/influxdata.list

# Install Telegraf
sudo apt update
sudo apt install telegraf

So that is not going to happen for us. This is something every developer in our company needs to do. Or we need to script around it.
Also our deployment pipelines do not have these instructions. Again implementing or scripting.

Might not seem like a big deal, but if every package does it like this, it becomes a mess. Also when we remove a dependency we need to crawl our scripting to see if anything custom was made. You get the point.

Standard ROS way of working is:

rosdep install --ignore-src --from-paths src/telegraf_resource_monitor

This means you should probably create a vendor package telegraf_vendor and have an exec_depend on it.

And for convienence a launchfile would be nice which I could just use as an example which launches everything I need to get /diagnostics and for example the CPU. This includes launching whatever process telegraf needs.

Spawning lots of topics

From the README:

/cpu/cpu0
/cpu/cpu1
/cpu/cpu2
/cpu/cpu3
/cpu/cpu_total
/disk/root
/mem
/procstat/telegraf_resource_monitor
/sensors/acpitz_acpi_0/temp1
/sensors/amdgpu_pci_0400/edge
/sensors/amdgpu_pci_0400/slowppt
/sensors/amdgpu_pci_0400/vddgfx
/sensors/amdgpu_pci_0400/vddnb
/sensors/bat1_acpi_0/in0
/sensors/iwlwifi_1_virtual_0/temp1
/sensors/k10temp_pci_00c3/tctl
/sensors/nvme_pci_0100/composite
/sensors/nvme_pci_0100/sensor_1

But like @peci1 we’re only interested in /diagnostics for now. So this is clutting the topic list a lot.
I think the creation of N topics should be made optional. Or ideally we can pick some. For example we might have a special interest in the battery one, not the rest.

This probably does mean you need to find another way to interface between telegraf_resource_monitor and resource_diagnostics_updater, but not sure.

I was looking at this yesterday. Telegraf is in Ubuntu 22.04. But only there. I found the Debian mailing list saying something like “telegraf is adding new dependencies like hell, they try to monitor everything, and we don’t have time to keep up with this pace; thus we’re removing telegraf”. So I’m not sure how easy it would be to create a vendor package… But maybe its build script could be configured just for the basics…

Hi @Timple, Thanks for the feedback!

As I indicated earlier in this thread:

I actually also do this for the dev container I use to develop this repository in. The following being the section of code that does it:

# Install Telegraf
RUN wget https://dl.influxdata.com/telegraf/releases/telegraf-1.35.3_linux_amd64.tar.gz \
    && tar -xzf telegraf-1.35.3_linux_amd64.tar.gz \
    && rm telegraf-1.35.3_linux_amd64.tar.gz \
    && mv telegraf-1.35.3/usr/bin/telegraf /usr/local/bin/telegraf \
    && chmod +x /usr/local/bin/telegraf

Since this will allow you to control the version, would this make it more interesting for your organization? Either way, I will add this as an installation method in the readme.md!

Good point, I will add it!

So the generated topics are completely dependent on what you set in the telegraf.conf file, which gets used by Telegraf to determine what gets monitored. If you only want battery to be shown, then this is definitely configurable.
Do you see a scenario where you want to monitor resources through telegraf but do not want them to be injected into the ROS2 system? This could be the case if you only want to send them as metrics to a remote monitoring solution.

Maybe the binary can be kept in the install folder of the ROS 2 workspace, automatically downloaded by the _vendor package. So no pollution on the system paths happens at all.

1 Like