Start prototyping with “drop in nrpe” idea: T350360: Evaluate "drop in" replacement for nrpe scripts and T384472: Candidate nrpe checks for compatibility layer icinga/prometheus/alertmanager
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | tappof | T395441 Port all Icinga checks to Prometheus/Alertmanager: preparation | |||
| Open | tappof | T350360 Evaluate "drop in" replacement for nrpe scripts | |||
| Open | tappof | T395446 Evaluate which solution we could adopt as a drop-in replacement for NRPE (and start prototyping) |
Event Timeline
Change #1168150 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe wrapper: add wrapper to be invoked a systemd timer
@jcrespo, I'd like to thank you for the POC you provided for T350360: Evaluate "drop in" replacement for nrpe scripts. I used your code as the basis for the patchset I just pushed to Gerrit and to help move the discussion on the approach forward.
The patch is still incomplete (for example, it lacks error handling), but it implements some of the concepts we've discussed as a team for the NRPE drop-in replacement.
The script acts asynchronously and is meant to be triggered by a systemd timer. The resulting metrics will be scraped by Prometheus via the Node Exporter, which is already present on every host.
I removed the plugin output from all labels, as this data is not meant to be exported into Prometheus metrics as labels. Instead, the idea is to push the output to Logstash and include a label in any alert linking to the corresponding script logs. This way, we’ll hopefully get a cleaner and more readable version of the script’s debugging information and reduce the number of time series on the Prometheus side.
I also briefly implemented performance data parsing. The resulting series look like this:
root@phi-arclamp-01:/etc/nagios/nrpe.d# python3 /home/tappof/nrpe2nodexp.py --check-command check_disk_space --dry-run --debug
# HELP nagios_nrpe_check_result Result of Nagios-style check script execution
# TYPE nagios_nrpe_check_result gauge
nagios_nrpe_check_result{check_name="check_disk_space",status="OK"} 1.0
nagios_nrpe_check_result{check_name="check_disk_space",status="WARNING"} 0.0
nagios_nrpe_check_result{check_name="check_disk_space",status="CRITICAL"} 0.0
nagios_nrpe_check_result{check_name="check_disk_space",status="UNKNOWN"} 0.0
# HELP nagios_nrpe_check_perfdata PerfData generated by Nagios-style check script execution
# TYPE nagios_nrpe_check_perfdata gauge
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="994742108",entity="/dev",max="1025507328",min="0",unit="B",warn="963976888"} 0.0
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="200372387",entity="/run",max="206569472",min="0",unit="B",warn="194175303"} 0.0
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="20332203212",entity="/",max="20961034240",min="0",unit="B",warn="19703372185"} 4.258267136e+09
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="1005930414",entity="/dev/shm",max="1037041664",min="0",unit="B",warn="974819164"} 0.0
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="5085593",entity="/run/lock",max="5242880",min="0",unit="B",warn="4928307"} 0.0
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="125105602",entity="/boot/efi",max="128974848",min="0",unit="B",warn="121236357"} 1.1534336e+07
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="200372387",entity="/run/user/0",max="206569472",min="0",unit="B",warn="194175303"} 0.0
nagios_nrpe_check_perfdata{check_name="check_disk_space",crit="200372387",entity="/run/user/47319",max="206569472",min="0",unit="B",warn="194175303"} 0.0With the performance data series available, it will be possible to create graphs on dashboards without writing additional ad-hoc exporters.
One more thing about the nagios_nrpe_check_result metric: there will always be four series (one for each possible status), with only one of them set to 1.0—the one representing the script’s actual return value.
This approach simplifies alert definitions and also makes it easier to join series when needed.
We believe that this approach could partially automate the migration via Puppet by adding resources to the nrpe::monitor_service module, such as the wrapper script and the systemd::timer.
Please, let us know what you think about that.
Be careful, that is no longer a
Small application that listens on a given port for http requests and returns on the /metrics path a prometheus-compatible response
as the comment says in your patch. There is no http demon listening anymore. I would keep that, here or on a separate script that reads it and pushes it or serves it somewhere from file.
Also :
The script acts asynchronously
This is not true, until you show some puppet that implements such a thing 0:-) A different thing is if it is intended to be async.
The performance data also doesn't make sense to parse it- that is free text that other checks other than the disk space, (which doesn't make sense anyway to to leave it on icinga, as the data is already on prometheus), will have its own arbitrary format. It is better to discard for prometheus, as Filippo suggested.
Other than that, ok with the general idea, you can format it any way you want. Feel free to take over if you are going to work on this. I think the most important part change needed is to log the full output and store it/send it somewhere for debugging, even if you export to prometheus only a small subset. I think you need more familarity with the icinga checks typical output and its uses, such as that of the RAID check when it is failing, which is like ~4k of text. Filippo had already hinted at this on the patch and conversations on related patches. Remember that Riccardo will want to run some automation that outputs that text to Phabricator, among other things: T384837
Yeah, sure. I forgot to mention that my version is also just a POC, loosely adapted from your work... I haven't adjusted the comments and so on. Its purpose is simply to share some ideas.
Also :
The script acts asynchronously
This is not true, until you show some puppet that implements such a thing 0:-) A different thing is if it is intended to be async.
I'm not sure I got your point — let me try to say it another way: with this version, the Prometheus scraper doesn't wait for the NRPE script result in a synchronous way. Instead, it fetches cached data exported by the Node Exporter. By adjusting the systemd::timer parameters, we can define how often the cache should be updated.
Anyway, I'll complete the POC with some Puppet code implementing this, and test it on Pontoon.
The performance data also doesn't make sense to parse it- that is free text that other checks other than the disk space, (which doesn't make sense anyway to to leave it on icinga, as the data is already on prometheus), will have its own arbitrary format. It is better to discard for prometheus, as Filippo suggested.
I tried with check_disk_space just to quickly test the script and saw that its output complies with the development guidelines provided by Nagios which defines the structure for the performance data string as: 'label'=value[UOM];[warn];[crit];[min];[max]" (https://nagios-plugins.org/doc/guidelines.html#AEN200).
If not every script follows those guidelines and instead produces freeform text, I agree that parsing the performance data output is clearly not a reliable way to expose that information in Prometheus. However, if we want to support this feature, we could even add a parameter or flag to choose whether performance data parsing is required.
Other than that, ok with the general idea, you can format it any way you want. Feel free to take over if you are going to work on this.
Sure, I'd like to. It's part of our quarterly/fiscal year plan, but let me thank you once again for the effort you put in.
I think the most important part change needed is to log the full output and store it/send it somewhere for debugging, even if you export to prometheus only a small subset.
I agree. The plan we discussed is the one I outlined in the first comment, which involves sending the plugin output to Logstash.
I think you need more familarity with the icinga checks typical output and its uses, such as that of the RAID check when it is failing, which is like ~4k of text. Filippo had already hinted at this on the patch and conversations on related patches. Remember that Riccardo will want to run some automation that outputs that text to Phabricator, among other things: T384837
As far as I remember, though I could be wrong, the RAID check itself does not produce such a large amount of output text in <plugin output> or <performance data>. Instead, it triggers an 'Event Handler' (raid_handler) that collects all the data and pushes it to a Phabricator task. The automation described in T384837: Integration between alertmanager and cumin cookbooks is (also) the mechanism intended to replace the event handler in an Icinga-free environment.
Thanks again for sharing your thoughts with us.
the RAID check itself does not produce such a large amount of output text in <plugin output>
I believe it used to do it (if it is no longer in use, that I am not sure), that is why it needed to be compressed (it doesn't log it in plain text by default). So I think it is more important to test the compatibility on all checks than do 1 thoroughly.
In any case, talk directly to Riccardo- he will have the details, but I think it is a use case important for the architecture in terms of how he prefers to handle it and hook to it. Having said that- this is the perfect time to refactor and fix issues. When I said to familiarize with checks is because they apparently do very weird things and things not intended. For example, get-raid-status-broadcom returns: communication: 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enclosure: 0 OK which you can clearly see it will have unintended consequences. I was not critizising you, but warning you about the of the huge amounts of cruft and bugs that were still there piled up over the years. :-(
I think you have clear the idea of my prototype and next steps, so I will let you cook the details and take over. 0:-D
Change #1168176 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] raid: Do not use the pipe symbol '|' as a separator for icinga checks
Change #1168176 merged by Jcrespo:
[operations/puppet@production] raid: Do not use the pipe symbol '|' as a separator for icinga checks
Change #1168150 merged by Tiziano Fogli:
[operations/puppet@production] nrpe wrapper: add wrapper to be invoked a systemd timer
Change #1174406 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe wrapper: install nrpe2nodexp dependencies
Change #1174406 merged by Tiziano Fogli:
[operations/puppet@production] nrpe wrapper: install nrpe2nodexp dependencies
Change #1174411 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing)
Change #1174462 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe wrapper: install dependencies only on bullseye and newer
Change #1174462 merged by Tiziano Fogli:
[operations/puppet@production] nrpe wrapper: install dependencies only on bullseye and newer
Change #1174411 merged by Tiziano Fogli:
[operations/puppet@production] nrpe wrapper: enable nrpe2nodexp for check_disk_space (for testing)
Change #1174729 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe wrapper: define Prometheus alerts via Puppet resources
Change #1180501 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe wrapper: enable nrpe2nodexp for check_ferm_active (testing)
Change #1174729 merged by Tiziano Fogli:
[operations/puppet@production] nrpe wrapper: define Prometheus alerts via Puppet resources
Change #1180501 merged by Tiziano Fogli:
[operations/puppet@production] nrpe wrapper: enable nrpe2nodexp for check_ferm_active (testing)
Change #1180566 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] Reapply "nrpe wrapper: define Prometheus alerts via Puppet resources"
Change #1180565 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] cloud: add missing hiera values to enable nrpe wrapper on vps
Change #1180565 merged by Tiziano Fogli:
[operations/puppet@production] cloud: add missing hiera values to enable nrpe wrapper on vps
Change #1180566 merged by Tiziano Fogli:
[operations/puppet@production] Reapply "nrpe wrapper: define Prometheus alerts via Puppet resources"
Change #1180826 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] pdb_resource_exporter: add query to track migrated nrpe checks
Change #1180826 merged by Tiziano Fogli:
[operations/puppet@production] pdb_resource_exporter: add query to track migrated nrpe checks
Change #1180862 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] pdb_resource_exporter: add query to track migrated nrpe checks
Change #1180862 merged by Tiziano Fogli:
[operations/puppet@production] pdb_resource_exporter: add query to track migrated nrpe checks
Change #1181650 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe2nodexp: remove file under node.d for disabled checks
Change #1181650 merged by Tiziano Fogli:
[operations/puppet@production] nrpe2nodexp: remove file under node.d for disabled checks
Mentioned in SAL (#wikimedia-operations) [2025-08-26T00:07:25Z] <brett> Run systemctl reset-failed on disappeared nrpe2nodexp-disk_space.timer units (T395446)
There were oodles of servers with a failed nrpe2nodexp-disk_space.timer unit now - it didn't trigger any errors but I was noticing them while doing other stuff today. I went ahead and reset them with:
$ sudo -i cumin '*' 'systemctl -q is-failed nrpe2nodexp-disk_space.timer && [ $(systemctl show -P LoadState nrpe2nodexp-disk_space.timer) = 'not-found' ] && systemctl reset-failed nrpe2nodexp-disk_space.timer'Thank you, @BCornwall. I tried to manage the checks removal using the ensure param consistently… I’ll check if something is missing in systems::timer::job… or somewhere else.
Change #1182148 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing
Change #1182148 merged by Tiziano Fogli:
[operations/puppet@production] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing
Change #1182511 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpewrapper: fix max parameters
Change #1182511 abandoned by Tiziano Fogli:
[operations/puppet@production] nrpewrapper: fix max parameters
Reason:
max not supported yet in puppet 5
Change #1182524 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing
Change #1182524 merged by Tiziano Fogli:
[operations/puppet@production] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing
Still needs further investigation, but with ensure => false, the daemon-reload command is triggered after the unit file is removed, without explicitly disabling the timer beforehand. This would result in the "not found" status described.
Change #1184487 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe2nodexp: add alertmanager_team param to override role_owner metric
Change #1184487 merged by Tiziano Fogli:
[operations/puppet@production] nrpe2nodexp: add alertmanager_team param to override role_owner metric
Change #1194632 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpewrapper: correlate Prometheus "for:" duration with timer interval
Change #1194632 merged by Tiziano Fogli:
[operations/puppet@production] nrpewrapper: correlate Prometheus "for:" duration with timer interval
Change #1197655 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe2nodexp: add randomized delay to timers
Change #1197655 merged by Tiziano Fogli:
[operations/puppet@production] nrpe2nodexp: add randomized delay to timers
Change #1199242 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):
[operations/puppet@production] nrpe2nodexp: use service description as alertname
Change #1199242 merged by Tiziano Fogli:
[operations/puppet@production] nrpe2nodexp: use service description as alertname