Page MenuHomePhabricator

WDQS: Making BlazegraphFailedServerRatioIncrease alerts less sensitive
Closed, ResolvedPublic

Description

I've noticed a lot of these alerts flapping in #wikidata-data-platform-alerts IRC lately. As they fire frequently and tend to clear within 10 minutes, I think there is an opportunity to make them less sensitive.

Creating this ticket to:

  • Make the alert less sensitive
  • Ensure Wikidata platform team is comfortable with this change, and discuss further if not.

Event Timeline

bking renamed this task from WDQS: Detune BlazegraphFailedServerRatioIncrease alerts to WDQS: Making BlazegraphFailedServerRatioIncrease alerts less sensitive.Feb 3 2026, 6:30 PM
bking updated the task description. (Show Details)

Change #1236852 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] wdqs: detune BlazegraphFailedServerRatioIncrease

https://gerrit.wikimedia.org/r/1236852

Change #1236852 merged by jenkins-bot:

[operations/alerts@master] wdqs: detune BlazegraphFailedServerRatioIncrease

https://gerrit.wikimedia.org/r/1236852

RKemper claimed this task.
RKemper subscribed.

I'll update the Wikidata platform team on T414306 with this change. I think given it's a very simple change of 30m -> 45m, and the original intent of the alert is still preserved, there's no need to block this on getting formal approval. So I'll mark this resolved.

Change #1239088 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] wdp: blazegraph: increase alert threshold for 5xx

https://gerrit.wikimedia.org/r/1239088

Change #1239088 merged by jenkins-bot:

[operations/alerts@master] wdp: blazegraph: increase alert threshold for 5xx

https://gerrit.wikimedia.org/r/1239088