Page MenuHomePhabricator

Improve Search team alerting for missing masters
Open, HighPublic2 Estimated Story Points

Description

We lost an Elastic Psi master to hardware failure in T311939 . The only notice we received was an email on Sunday at UTC 0853 . During a subsequent reimage operation, we lost another master and the entire psi cluster in CODFW went offline for a few minutes. (Note that there was no discernable user impact).

Creating these this ticket so Search team can decide on the proper urgency for failed masters and add the appropriate amount/type of alerting.

Event Timeline

bking updated the task description. (Show Details)

Action item from today's meeting: reach out to Observability for recommendations on how to handle this.

Gehel triaged this task as High priority.Jul 25 2022, 3:32 PM
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.
Gehel moved this task from Ops / SRE to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Per @EBernhardson comment at today's triage meeting, the secondary DC (usually CODFW) will start to receive Elastic read traffic. Had this been the case during the issue mentioned above, it would have had user impact. So this increases the priority of this ticket.

MPhamWMF set the point value for this task to 2.Aug 1 2022, 3:48 PM
bking renamed this task from Improve alerting for missing Elastic masters to Improve Search team alerting.Aug 18 2022, 4:57 PM

Change 824553 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: add bking as contact for wdqs alerts

https://gerrit.wikimedia.org/r/824553

Change 824553 merged by Bking:

[operations/puppet@production] wdqs: add bking as contact for wdqs alerts

https://gerrit.wikimedia.org/r/824553

Tying this in to recent incident , as our monitoring did not catch a user-impacting autocomplete issue.

Suggestion from @EBernhardson : "random guesses at what we need, the search reindex process looks at the old index and the new index, and if the counts are way off it declares failure. Here we build an index with 25k docs instead of 9.5M and the automation said 'that looks fine, promote it'" (Sanity checking during reindex process).

bking renamed this task from Improve Search team alerting to Improve Search team alerting for missing masters.Sep 22 2022, 3:38 PM
bking removed bking as the assignee of this task.Oct 12 2022, 1:42 PM
bking updated Other Assignee, removed: RKemper.