The full graph WDQS hosts in eqiad appear to have suffered a cascading failure starting at about 1900 UTC on 23 March 2025 ( ref this graph ; when a specific host's lag metrics disappear from the graph, that means it stopped working).
This particular failure scenario did not trigger any alerts until the entire service was lagging, at which point we got ElevatedMaxLagWDQS: WDQS lag is above 10 minutes alerts. Creating this ticket to:
- Decide on the best alert or alerts for this failure scenario
^^ We'll try a thread count alert
- Implement the alerts and verify operation