Page MenuHomePhabricator

Evaluate options to soften wdqs paging
Closed, ResolvedPublic

Description

In https://wikitech.wikimedia.org/wiki/Incidents/2022-12-12_wdqs_codfw_brief_outage, we had a brief codfw outage which ended up self healing. However, our automated monitoring emitted a page before that self healing could take place.

Given the recent efforts by Search team to formalize our WDQS uptime SLO, we should have our monitoring wait at least half an hour or so before paging (potentially longer).

There's a technical limitation, however - we have generic pybal pages that fire when the insufficient hosts are alive (as seen by pybal's configured health checks) based off the service's pybal depool threshold. We should see if we can implement a way to disable paging for specific services (WDQS in this case) for general alerts. This likely will require some changes to the associated puppet code, but we'll have to talk to o11y to understand more and see if there's a reasonable/feasible way of relaxing pybal paging on a service-specific basis.

AC:

  • We know if it is possible to tune alerts around WDQS
    • Disable paging upon probe: Service wdqs-ssl:443 has failed probes
  • We have a decision on how to move forward, that is validated by Observability
  • Implementation is not part of this ticket

Event Timeline

Gehel removed the point value for this task.

Change 889662 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: no longer page on failed probe

https://gerrit.wikimedia.org/r/889662

Change 889662 merged by Ryan Kemper:

[operations/puppet@production] wdqs: no longer page on failed probe

https://gerrit.wikimedia.org/r/889662

Change 889852 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: don't page for wdqs-heavy or wdqs-ssl

https://gerrit.wikimedia.org/r/889852

Change 889852 merged by Ryan Kemper:

[operations/puppet@production] wdqs: don't page for wdqs-heavy or wdqs-ssl

https://gerrit.wikimedia.org/r/889852