Page MenuHomePhabricator

WDQS: Alert on high thread count
Open, In Progress, MediumPublic

Description

The full graph WDQS hosts in eqiad appear to have suffered a cascading failure starting at about 1900 UTC on 23 March 2025 ( ref this graph ; when a specific host's lag metrics disappear from the graph, that means it stopped working).

This particular failure scenario did not trigger any alerts until the entire service was lagging, at which point we got ElevatedMaxLagWDQS: WDQS lag is above 10 minutes alerts. Creating this ticket to:

  • Decide on the best alert or alerts for this failure scenario

^^ We'll try a thread count alert

  • Implement the alerts and verify operation

Event Timeline

bking triaged this task as Medium priority.
bking renamed this task from WDQS: Alert on high thread count to WDQS: Alert on high thread count or no lag metrics reported.Mar 24 2025, 7:06 PM
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)
bking added a subscriber: dcausse.

Change #1130730 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] WIP: wdqs: Add alerts for no lag metrics reported

https://gerrit.wikimedia.org/r/1130730

bking changed the task status from Open to In Progress.Mar 25 2025, 1:15 PM
bking claimed this task.

Some notes from our discussion at today's Search Platform standup:

  • We got ProbeDown alerts for these hosts Sunday night, SRE needs to be more responsive on Monday morning
  • We could have an alert on thread count.
  • We've also talked about automatically depooling lagged hosts (see T270614 ). This ticket is already on the Data Platform SRE workboard.
bking added a subscriber: RKemper.

@RKemper Per 1x1 with @Gehel today, I haven't done a good job of moving this forward. We were wondering if we could delegate this one to you? I'm putting it in your name, but feel free to ping back if that doesn't work for you.

Change #1130730 abandoned by Bking:

[operations/alerts@master] WIP: wdqs: Add alerts for no lag metrics reported

Reason:

won't work and probably not needed

https://gerrit.wikimedia.org/r/1130730

bking renamed this task from WDQS: Alert on high thread count or no lag metrics reported to WDQS: Alert on high thread count.Sep 10 2025, 5:42 PM

Change #1198161 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] (wip) wdqs: detect blazegraph deadlock

https://gerrit.wikimedia.org/r/1198161

Uploaded a proposed solution that alerts if the triples count metric is missing. Blazegraph deadlock always leads to the triple count metric missing.

I'm not sure if this current proposed solution is robust to firing if a previously-extant host is decommissioned. We might need to make the alert more complex to account for this possibility.