Alertmanager
Open, In Progress, MediumPublic
Actions

Description

As part of the icinga deprecation we'll need to port the following alerts to Prometheus/Alertmanager:

class profile::query_service::monitor::categories {
   # categories are updated weekly, this is a low frequency check
    nrpe::monitor_service { 'Categories_Ping':
        description    => 'Categories endpoint',
        nrpe_command   => '/usr/local/lib/nagios/plugins/check_categories.py --ping',
        check_interval => 720, # every 6 hours
        retry_interval => 60,  # retry after 1 hour
        notes_url      => 'https://wikitech.wikimedia.org/wiki/Wikidata_query_service',
    }

    nrpe::monitor_service { 'Categories_Lag':
        description    => 'Categories update lag',
        nrpe_command   => '/usr/local/lib/nagios/plugins/check_categories.py --lag',
        check_interval => 720, # every 6 hours
        retry_interval => 60,  # retry after 1 hour
        notes_url      => 'https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Categories_update_lag',
    }
}

My understanding from reading modules/query_service/files/nagios/check_categories.py is that there's a sparql query run periodically, both to check whether the query can be run at all (ping) and if categories lag is within threshold (lag).

The ideal scenario IMHO if we had this lag value already as a prometheus metric, then creating the alert is straightforward. Alternatively if we can use a proxy metric for the same? Of course another ideal situation is whether the check is obsolete now and it can be removed altogether. Lastly, failing all the above, we can turn the lag into a prometheus metric by adapting the check and periodically run it.

Details

Subject	Repo	Branch	Lines +/-
team-search-platform: Add alert for wdqs-categories lag	operations/alerts	master	+32 -0
team-data-platform: remove misconfigured alert	operations/alerts	master	+0 -14
wdqs categories: ship lastUpdated metric	operations/puppet	production	+12 -1
wdqs max lag: target specific port	operations/alerts	master	+1 -1
Better filtering of prom metrics for wdqs max lag	mediawiki/extensions/Wikidata.org	master	+19 -4
wdqs categories: remove redundant ping check	operations/puppet	production	+0 -9

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T321808 Port all Icinga checks to Prometheus/Alertmanager
		In Progress		bking	T374916 Port Categories lag / ping checks to Prometheus/Alertmanager

Event Timeline

fgiunchedi created this task.Sep 17 2024, 9:37 AM

Maintenance_bot added a project: Wikidata.Sep 17 2024, 10:30 AM

We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.

not sure that the check_categories.py --ping is necessary and could be dropped imo, it should already be covered by some other sensors.

Gehel added a project: Data-Platform-SRE (2024.09.06 - 2024.09.27).Sep 17 2024, 2:18 PM

bking subscribed.Sep 17 2024, 2:22 PM

In T374916#10153020, @dcausse wrote:

We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.

I like this idea, specifically to generalize blazegraph-exporter a little and make it configurable such as "map this query to this metric" or sth similar, definitely worth exploring

In T374916#10153027, @dcausse wrote:

not sure that the check_categories.py --ping is necessary and could be dropped imo, it should already be covered by some other sensors.

That is very good information to have, thank you! +1 to dropping the check altogether

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.Sep 17 2024, 5:28 PM

bking changed the task status from Open to In Progress.Sep 17 2024, 6:31 PM

bking claimed this task.

bking triaged this task as Medium priority.

bking updated Other Assignee, added: RKemper.

bking moved this task from Backlog - operations to In Progress on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.

bking moved this task from In Progress to Backlog - operations on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.

bking moved this task from Backlog - operations to In Progress on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.

Change #1073510 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs categories: remove redundant ping check

https://gerrit.wikimedia.org/r/1073510

gerritbot added a project: Patch-For-Review.Sep 17 2024, 6:44 PM

Change #1073529 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs categories: ship lastUpdated metric

https://gerrit.wikimedia.org/r/1073529

Change #1073533 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] wdqs max lag: specify specific port

https://gerrit.wikimedia.org/r/1073533

@EBernhardson, @bking and I paired on implementing this. We took @dcausse's approach of using the namespace to decide on what exact sparql query to run to get the lastUpdated value extracted out properly.

We've tested with REPLs and the likes on a wdqs host and this approach should work.

Once we merge the exporter change we're going to have both the wikidata and categories metrics shipping with cluster=wdqs due to how our stack works, so we need to make sure downstream "users" of the metric (e.g. alerting, the wikidata maxlag backoff logic, etc) properly specify the right instance, etc in order to just get the wikidata-specific metrics.

I've uploaded a patch to the alerts repo accordingly. We still need to track down and change the maxlag backoff part though.

Change #1073510 merged by Ryan Kemper:

[operations/puppet@production] wdqs categories: remove redundant ping check

https://gerrit.wikimedia.org/r/1073510

Change #1073746 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/Wikidata.org@master] Better filtering of prom metrics for wdqs max lag

https://gerrit.wikimedia.org/r/1073746

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.Sep 23 2024, 1:57 PM

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Change #1073746 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] Better filtering of prom metrics for wdqs max lag

https://gerrit.wikimedia.org/r/1073746

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.25; 2024-10-01).Sep 26 2024, 6:00 PM

Gehel edited projects, added Data-Platform-SRE (2024.09.28 - 2024.10.18); removed Data-Platform-SRE (2024.09.06 - 2024.09.27).Sep 27 2024, 1:00 PM

Gehel moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.

Change #1073533 merged by jenkins-bot:

[operations/alerts@master] wdqs max lag: target specific port

https://gerrit.wikimedia.org/r/1073533

BTullis edited projects, added Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Data-Platform-SRE (2024.09.28 - 2024.10.18).Oct 18 2024, 3:08 PM

BTullis moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.

lmata edited projects, added SRE Observability (FY2024/2025-Q2); removed SRE Observability (FY2024/2025-Q1).Oct 31 2024, 9:16 PM

lmata moved this task from Inbox to Radar on the Observability-Alerting board.

lmata moved this task from Inbox to Radar on the SRE Observability (FY2024/2025-Q2) board.

Gehel edited projects, added Data-Platform-SRE (2024.11.09 - 2024.11.29); removed Data-Platform-SRE (2024.10.19 - 2024.11.08).Nov 8 2024, 9:48 AM

Gehel moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.

Gehel removed bking as the assignee of this task.Nov 26 2024, 2:46 PM

Gehel updated Other Assignee, removed: RKemper.

Gehel moved this task from In Progress to Backlog - operations on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.

Gehel edited projects, added Data-Platform-SRE (2024.11.30 - 2024.12.20); removed Data-Platform-SRE (2024.11.09 - 2024.11.29).Fri, Nov 29, 1:22 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.11.30 - 2024.12.20) board.

bking claimed this task.Tue, Dec 17, 8:59 PM

Per today's DPE SRE standup, I've grabbed this ticket and will try to move it forward.

Unfortunately, I'm not finding any metrics for wdqs-categories in Grafana Explorer. The categories exporter runs on port 9194, but if I type out

blazegraph_lastupdated{instance=

I only see options for port 9193 (main blazegraph instance) or 9195 (wcqs). I can confirm the exporter is listening and producing metrics:

root@wdqs1011:~# curl -s http://0:9194/metrics | grep blaze
# HELP blazegraph_journal_commit_count_total
# TYPE blazegraph_journal_commit_count_total counter
blazegraph_journal_commit_count_total 1.112443e+06
# HELP blazegraph_journal_total_commit_seconds Total time spent in commit.
# TYPE blazegraph_journal_total_commit_seconds gauge
blazegraph_journal_total_commit_seconds 241.216368789
# HELP blazegraph_journal_flush_write_set_seconds
# TYPE blazegraph_journal_flush_write_set_seconds gauge
blazegraph_journal_flush_write_set_seconds 179.130189722

I ran a pcap on wdqs1011 and I can confirm that the prometheus hosts are scraping the correct port.

Additionally, I checked /srv/prometheus/ops/targets/blazegraph_eqiad.yaml on prometheus1006, and it does look like the targets are configured correctly:

blazegraph_eqiad.yaml
15:  - wdqs1012:9194

The investigation continues...

@bking I can see the metrics in Grafana see for example https://grafana.wikimedia.org/goto/LyGMnoIHR?orgId=1
Your exporter is not exporting the blazegraph_lastupdated though, just the comments for it:

$ curl -s http://0:9194/metrics | grep blazegraph_lastupdated
# HELP blazegraph_lastupdated Last update timestamp
# TYPE blazegraph_lastupdated gauge

Change #1073529 merged by Bking:

[operations/puppet@production] wdqs categories: ship lastUpdated metric

https://gerrit.wikimedia.org/r/1073529

Change #1105368 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] team-data-platform: remove misconfigured alert

https://gerrit.wikimedia.org/r/1105368

Change #1105368 merged by Bking:

[operations/alerts@master] team-data-platform: remove misconfigured alert

https://gerrit.wikimedia.org/r/1105368

Maintenance_bot removed a project: Patch-For-Review.Wed, Dec 18, 3:31 PM

We're now shipping the metrics correctly (thanks volans and dcausse ).

The next step is to create a Prometheus alert that makes sense for wdqs-categories, which doesn't use the wdqs streaming updater. Instead, categories is updated via daily and weekly timers. When things are working, the lag increases continually until the timer is run, then drops to zero.

Change #1105451 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] team-search-platform: Add alert for wdqs-categories lag

https://gerrit.wikimedia.org/r/1105451

gerritbot added a project: Patch-For-Review.Wed, Dec 18, 9:44 PM

Port Categories lag / ping checks to Prometheus/AlertmanagerOpen, In Progress, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Port Categories lag / ping checks to Prometheus/Alertmanager
Open, In Progress, MediumPublic
Actions

Related Objects
Search...