Maniphest T325324

Evaluate options to soften wdqs paging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RKemper
	Dec 15 2022, 7:49 PM

Tags

Referenced Files

None

Subscribers

Description

In https://wikitech.wikimedia.org/wiki/Incidents/2022-12-12_wdqs_codfw_brief_outage, we had a brief codfw outage which ended up self healing. However, our automated monitoring emitted a page before that self healing could take place.

Given the recent efforts by Search team to formalize our WDQS uptime SLO, we should have our monitoring wait at least half an hour or so before paging (potentially longer).

There's a technical limitation, however - we have generic pybal pages that fire when the insufficient hosts are alive (as seen by pybal's configured health checks) based off the service's pybal depool threshold. We should see if we can implement a way to disable paging for specific services (WDQS in this case) for general alerts. This likely will require some changes to the associated puppet code, but we'll have to talk to o11y to understand more and see if there's a reasonable/feasible way of relaxing pybal paging on a service-specific basis.

AC:

We know if it is possible to tune alerts around WDQS
- Disable paging upon probe: Service wdqs-ssl:443 has failed probes
We have a decision on how to move forward, that is validated by Observability
Implementation is not part of this ticket

Details

	Subject	Repo	Branch	Lines +/-
	wdqs: don't page for wdqs-heavy or wdqs-ssl	operations/puppet	production	+2 -0
	wdqs: no longer page on failed probe	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		RKemper	T313751 Create WDQS uptime SLO
		Resolved		RKemper	T325324 Evaluate options to soften wdqs paging
		Open		None	T303134 Should wdqs LVS checks page

Event Timeline

RKemper created this task.Dec 15 2022, 7:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2022, 7:49 PM

RKemper added projects: SRE-OnFire, Sustainability (Incident Followup).Dec 15 2022, 7:51 PM

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Dec 19 2022, 4:40 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• MPhamWMF set the point value for this task to 3.Jan 30 2023, 4:56 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

Gehel updated the task description. (Show Details)Jan 30 2023, 4:56 PM

Gehel removed the point value for this task.

Change 889662 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: no longer page on failed probe

https://gerrit.wikimedia.org/r/889662

gerritbot added a project: Patch-For-Review.Feb 16 2023, 7:33 AM

RKemper updated the task description. (Show Details)Feb 16 2023, 7:35 AM

Gehel moved this task from Ready for Dev -- SRE/Ops to In Progress on the Discovery-Search (Current work) board.Feb 16 2023, 7:32 PM

Change 889662 merged by Ryan Kemper:

[operations/puppet@production] wdqs: no longer page on failed probe

https://gerrit.wikimedia.org/r/889662

Change 889852 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: don't page for wdqs-heavy or wdqs-ssl

https://gerrit.wikimedia.org/r/889852

Change 889852 merged by Ryan Kemper:

[operations/puppet@production] wdqs: don't page for wdqs-heavy or wdqs-ssl

https://gerrit.wikimedia.org/r/889852

Gehel assigned this task to RKemper.Feb 16 2023, 7:49 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 16 2023, 8:10 PM

Gehel moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.Feb 27 2023, 4:19 PM

RKemper updated the task description. (Show Details)Feb 28 2023, 8:08 PM

RKemper added a parent task: T313751: Create WDQS uptime SLO.

RKemper added a subtask: T303134: Should wdqs LVS checks page.

RKemper updated the task description. (Show Details)Feb 28 2023, 8:10 PM

Gehel closed this task as Resolved.Mar 10 2023, 2:09 PM

RKemper updated the task description. (Show Details)Apr 17 2023, 6:40 PM