Improve Search team alerting for missing masters
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	bking
	Jul 14 2022, 9:01 PM

Description

We lost an Elastic Psi master to hardware failure in T311939 . The only notice we received was an email on Sunday at UTC 0853 . During a subsequent reimage operation, we lost another master and the entire psi cluster in CODFW went offline for a few minutes. (Note that there was no discernable user impact).

Creating these this ticket so Search team can decide on the proper urgency for failed masters and add the appropriate amount/type of alerting.

Details

	Subject	Repo	Branch	Lines +/-
	wdqs: add bking as contact for wdqs alerts	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T313431: Increase Elastic master-eligible nodes from 3 to 5
Mentioned Here: T311939: Degraded RAID on elastic2049

Event Timeline

bking created this task.Jul 14 2022, 9:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2022, 9:01 PM

bking updated the task description. (Show Details)Jul 14 2022, 9:03 PM

bking updated the task description. (Show Details)

Aklapper added a project: Observability-Alerting.Jul 15 2022, 10:32 AM

Gehel edited projects, added Discovery-Search; removed Discovery-ARCHIVED.Jul 20 2022, 2:58 PM

Action item from today's meeting: reach out to Observability for recommendations on how to handle this.

bking mentioned this in T313431: Increase Elastic master-eligible nodes from 3 to 5.Jul 20 2022, 6:00 PM

Gehel triaged this task as High priority.Jul 25 2022, 3:32 PM

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

Gehel moved this task from Ops / SRE to Current work on the Discovery-Search board.

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

bking claimed this task.Jul 25 2022, 3:32 PM

Per @EBernhardson comment at today's triage meeting, the secondary DC (usually CODFW) will start to receive Elastic read traffic. Had this been the case during the issue mentioned above, it would have had user impact. So this increases the priority of this ticket.

• MPhamWMF set the point value for this task to 2.Aug 1 2022, 3:48 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

bking renamed this task from Improve alerting for missing Elastic masters to Improve Search team alerting.Aug 18 2022, 4:57 PM

Change 824553 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: add bking as contact for wdqs alerts

https://gerrit.wikimedia.org/r/824553

gerritbot added a project: Patch-For-Review.Aug 18 2022, 7:19 PM

lmata moved this task from Inbox to Radar on the Observability-Alerting board.Aug 22 2022, 3:12 PM

Change 824553 merged by Bking:

[operations/puppet@production] wdqs: add bking as contact for wdqs alerts

https://gerrit.wikimedia.org/r/824553

Maintenance_bot removed a project: Patch-For-Review.Sep 6 2022, 10:31 PM

Tying this in to recent incident , as our monitoring did not catch a user-impacting autocomplete issue.

bking added projects: SRE-OnFire, Sustainability (Incident Followup).Sep 9 2022, 3:07 PM

Suggestion from @EBernhardson : "random guesses at what we need, the search reindex process looks at the old index and the new index, and if the counts are way off it declares failure. Here we build an index with 25k docs instead of 9.5M and the automation said 'that looks fine, promote it'" (Sanity checking during reindex process).

bking renamed this task from Improve Search team alerting to Improve Search team alerting for missing masters.Sep 22 2022, 3:38 PM

bking updated Other Assignee, added: RKemper.Sep 26 2022, 4:35 PM

bking edited projects, added Discovery-Search; removed Discovery-Search (Current work).

bking edited projects, added Discovery-Search (Current work); removed Discovery-Search.

bking removed bking as the assignee of this task.Oct 12 2022, 1:42 PM

bking updated Other Assignee, removed: RKemper.

EBernhardson moved this task from Ready for Dev -- SWE to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Oct 24 2022, 3:29 PM

• MPhamWMF moved this task from Ready for Dev -- SRE/Ops to Incoming on the Discovery-Search (Current work) board.Nov 14 2022, 10:32 PM

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).Nov 21 2022, 4:38 PM

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

We do have an alert for missing masters

lmata moved this task from Radar to Done on the Observability-Alerting board.Jan 16 2023, 5:58 PM

Improve Search team alerting for missing mastersClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Improve Search team alerting for missing masters
Closed, ResolvedPublic2 Estimated Story Points
Actions