Page MenuHomePhabricator

Port Categories lag / ping checks to Prometheus/Alertmanager
Open, In Progress, MediumPublic

Description

As part of the icinga deprecation we'll need to port the following alerts to Prometheus/Alertmanager:

class profile::query_service::monitor::categories {
   # categories are updated weekly, this is a low frequency check
    nrpe::monitor_service { 'Categories_Ping':
        description    => 'Categories endpoint',
        nrpe_command   => '/usr/local/lib/nagios/plugins/check_categories.py --ping',
        check_interval => 720, # every 6 hours
        retry_interval => 60,  # retry after 1 hour
        notes_url      => 'https://wikitech.wikimedia.org/wiki/Wikidata_query_service',
    }

    nrpe::monitor_service { 'Categories_Lag':
        description    => 'Categories update lag',
        nrpe_command   => '/usr/local/lib/nagios/plugins/check_categories.py --lag',
        check_interval => 720, # every 6 hours
        retry_interval => 60,  # retry after 1 hour
        notes_url      => 'https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Categories_update_lag',
    }
}

My understanding from reading modules/query_service/files/nagios/check_categories.py is that there's a sparql query run periodically, both to check whether the query can be run at all (ping) and if categories lag is within threshold (lag).

The ideal scenario IMHO if we had this lag value already as a prometheus metric, then creating the alert is straightforward. Alternatively if we can use a proxy metric for the same? Of course another ideal situation is whether the check is obsolete now and it can be removed altogether. Lastly, failing all the above, we can turn the lag into a prometheus metric by adapting the check and periodically run it.

Event Timeline

We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.

not sure that the check_categories.py --ping is necessary and could be dropped imo, it should already be covered by some other sensors.

We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.

I like this idea, specifically to generalize blazegraph-exporter a little and make it configurable such as "map this query to this metric" or sth similar, definitely worth exploring

not sure that the check_categories.py --ping is necessary and could be dropped imo, it should already be covered by some other sensors.

That is very good information to have, thank you! +1 to dropping the check altogether

bking changed the task status from Open to In Progress.Sep 17 2024, 6:31 PM
bking claimed this task.
bking triaged this task as Medium priority.
bking updated Other Assignee, added: RKemper.

Change #1073510 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs categories: remove redundant ping check

https://gerrit.wikimedia.org/r/1073510

Change #1073529 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs categories: ship lastUpdated metric

https://gerrit.wikimedia.org/r/1073529

Change #1073533 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] wdqs max lag: specify specific port

https://gerrit.wikimedia.org/r/1073533

@EBernhardson, @bking and I paired on implementing this. We took @dcausse's approach of using the namespace to decide on what exact sparql query to run to get the lastUpdated value extracted out properly.

We've tested with REPLs and the likes on a wdqs host and this approach should work.

Once we merge the exporter change we're going to have both the wikidata and categories metrics shipping with cluster=wdqs due to how our stack works, so we need to make sure downstream "users" of the metric (e.g. alerting, the wikidata maxlag backoff logic, etc) properly specify the right instance, etc in order to just get the wikidata-specific metrics.

I've uploaded a patch to the alerts repo accordingly. We still need to track down and change the maxlag backoff part though.

Change #1073510 merged by Ryan Kemper:

[operations/puppet@production] wdqs categories: remove redundant ping check

https://gerrit.wikimedia.org/r/1073510

Change #1073746 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/Wikidata.org@master] Better filtering of prom metrics for wdqs max lag

https://gerrit.wikimedia.org/r/1073746

Change #1073746 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] Better filtering of prom metrics for wdqs max lag

https://gerrit.wikimedia.org/r/1073746

Change #1073533 merged by jenkins-bot:

[operations/alerts@master] wdqs max lag: target specific port

https://gerrit.wikimedia.org/r/1073533

Gehel removed bking as the assignee of this task.Nov 26 2024, 2:46 PM
Gehel updated Other Assignee, removed: RKemper.

Per today's DPE SRE standup, I've grabbed this ticket and will try to move it forward.

Unfortunately, I'm not finding any metrics for wdqs-categories in Grafana Explorer. The categories exporter runs on port 9194, but if I type out

blazegraph_lastupdated{instance=

I only see options for port 9193 (main blazegraph instance) or 9195 (wcqs). I can confirm the exporter is listening and producing metrics:

root@wdqs1011:~# curl -s http://0:9194/metrics | grep blaze
# HELP blazegraph_journal_commit_count_total
# TYPE blazegraph_journal_commit_count_total counter
blazegraph_journal_commit_count_total 1.112443e+06
# HELP blazegraph_journal_total_commit_seconds Total time spent in commit.
# TYPE blazegraph_journal_total_commit_seconds gauge
blazegraph_journal_total_commit_seconds 241.216368789
# HELP blazegraph_journal_flush_write_set_seconds
# TYPE blazegraph_journal_flush_write_set_seconds gauge
blazegraph_journal_flush_write_set_seconds 179.130189722

I ran a pcap on wdqs1011 and I can confirm that the prometheus hosts are scraping the correct port.

Additionally, I checked /srv/prometheus/ops/targets/blazegraph_eqiad.yaml on prometheus1006, and it does look like the targets are configured correctly:

blazegraph_eqiad.yaml
15:  - wdqs1012:9194

The investigation continues...

@bking I can see the metrics in Grafana see for example https://grafana.wikimedia.org/goto/LyGMnoIHR?orgId=1
Your exporter is not exporting the blazegraph_lastupdated though, just the comments for it:

$ curl -s http://0:9194/metrics | grep blazegraph_lastupdated
# HELP blazegraph_lastupdated Last update timestamp
# TYPE blazegraph_lastupdated gauge

Change #1073529 merged by Bking:

[operations/puppet@production] wdqs categories: ship lastUpdated metric

https://gerrit.wikimedia.org/r/1073529

Change #1105368 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] team-data-platform: remove misconfigured alert

https://gerrit.wikimedia.org/r/1105368

Change #1105368 merged by Bking:

[operations/alerts@master] team-data-platform: remove misconfigured alert

https://gerrit.wikimedia.org/r/1105368

We're now shipping the metrics correctly (thanks volans and dcausse ).

The next step is to create a Prometheus alert that makes sense for wdqs-categories, which doesn't use the wdqs streaming updater. Instead, categories is updated via daily and weekly timers. When things are working, the lag increases continually until the timer is run, then drops to zero.

Change #1105451 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] team-search-platform: Add alert for wdqs-categories lag

https://gerrit.wikimedia.org/r/1105451