Page MenuHomePhabricator

wdqs: alert on ratio of failed queries increase
Closed, ResolvedPublic

Description

As a developer and operator of WDQS I would like to receive an alert at warning level when the ratio of failed queries increases compared to baseline.

While this metric is not entirely part of our SLIs (the SLO considers 403 and 419 acceptable responses), it would be useful to be aware of changes in traffic patterns to proactively monitor the services.

Note: the current implementation of "failed queries" in Grafana includes both 4xx and 5xx. Those are different failure scenarios, we should consider reporting and alerting them separately. In particular, we are interested in tracking timeouts.

AC

  • we have a refined definition of "failed query", aligned with SLO.
  • we have defined a quantifiable baseline of expected "failed queries" ratio.
  • we have an alert (warning) for increased ratio of failed queries (and/or timeouts).

Event Timeline

gmodena renamed this task from wdqs: alert on ratio of failed queries increase to [NEEDS GROOMING] wdqs: alert on ratio of failed queries increase.Jan 12 2026, 9:38 AM

Note to self:

  • runbook update
  • mention the issue that we cannot distinguish between failures on proxy level and blazegraph level
trueg renamed this task from [NEEDS GROOMING] wdqs: alert on ratio of failed queries increase to wdqs: alert on ratio of failed queries increase.Jan 14 2026, 3:26 PM
trueg changed the task status from Open to In Progress.
trueg claimed this task.

Change #1227364 had a related patch set uploaded (by Trueg; author: Trueg):

[operations/alerts@master] blazegraph: alert on ratio of failed queries increase

https://gerrit.wikimedia.org/r/1227364

@gmodena I am not entirely sure what is meant by "aligned with SLO" when it comes to the "definition of "failed query"".

Change #1227364 merged by jenkins-bot:

[operations/alerts@master] blazegraph: alert on ratio of failed queries increase

https://gerrit.wikimedia.org/r/1227364

f/up from a chat we had earlier:

@gmodena I am not entirely sure what is meant by "aligned with SLO" when it comes to the "definition of "failed query"".

The SLO describes 403 and 419 as acceptable responses. To align, we decided not to alert on 403 and 419 (they don't count towards WDQS' error budget).

trueg reopened this task as In Progress.

Change #1236852 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/alerts@master] wdqs: detune BlazegraphFailedServerRatioIncrease

https://gerrit.wikimedia.org/r/1236852

Change #1236852 merged by jenkins-bot:

[operations/alerts@master] wdqs: detune BlazegraphFailedServerRatioIncrease

https://gerrit.wikimedia.org/r/1236852

RKemper subscribed.

We've tuned the alert slightly; a little over a third of the time we were seeing the alert fire for between 1-14 minutes and then resolve. So we've bumped from 30m to 45m to make the alert more actionable on the SRE side. Let us know if there's any issues with the change; for the timebeing I've merged the patch.

thanks @RKemper . The alert is still a bit spammy. Taking a look now.

Change #1239088 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] wdp: blazegraph: increase alert threshold for 5xx

https://gerrit.wikimedia.org/r/1239088

Change #1239088 merged by jenkins-bot:

[operations/alerts@master] wdp: blazegraph: increase alert threshold for 5xx

https://gerrit.wikimedia.org/r/1239088