As part of the icinga deprecation we'll need to port the following alerts to Prometheus/Alertmanager:
class profile::query_service::monitor::categories { # categories are updated weekly, this is a low frequency check nrpe::monitor_service { 'Categories_Ping': description => 'Categories endpoint', nrpe_command => '/usr/local/lib/nagios/plugins/check_categories.py --ping', check_interval => 720, # every 6 hours retry_interval => 60, # retry after 1 hour notes_url => 'https://wikitech.wikimedia.org/wiki/Wikidata_query_service', } nrpe::monitor_service { 'Categories_Lag': description => 'Categories update lag', nrpe_command => '/usr/local/lib/nagios/plugins/check_categories.py --lag', check_interval => 720, # every 6 hours retry_interval => 60, # retry after 1 hour notes_url => 'https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Categories_update_lag', } }
My understanding from reading modules/query_service/files/nagios/check_categories.py is that there's a sparql query run periodically, both to check whether the query can be run at all (ping) and if categories lag is within threshold (lag).
The ideal scenario IMHO if we had this lag value already as a prometheus metric, then creating the alert is straightforward. Alternatively if we can use a proxy metric for the same? Of course another ideal situation is whether the check is obsolete now and it can be removed altogether. Lastly, failing all the above, we can turn the lag into a prometheus metric by adapting the check and periodically run it.