Page MenuHomePhabricator

Elasticsearch: Alert on upstream errors for MW API
Closed, ResolvedPublic

Description

Creating this as a follow up task from this incident .

Data Platform SRE and Search Platform should be alerted when the connection error rate from Mediawiki app servers to search clusters goes above a certain percentage. See this Envoy Telemetry dashboard for an example of what we could/should be alerting on.

Creating this ticket to:

  • Create alerts
  • Confirm operation Decided against deliberately triggering, as the recent incident and the promtool rules evaluation should be enough to give us confidence in the alert.

Event Timeline

Gehel triaged this task as High priority.Mon, Apr 29, 2:24 PM
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.
bking renamed this task from Elasticsearch: Alert on downstream errors to Elasticsearch: Alert on upstream errors for MW API.Mon, Apr 29, 4:49 PM
bking claimed this task.
bking updated Other Assignee, added: RKemper.

Change #1025453 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] search-platform: monitoring/alert on upstream MW API errors

https://gerrit.wikimedia.org/r/1025453

Change #1025453 merged by jenkins-bot:

[operations/alerts@master] search-platform: monitor/alert on elastic request failures

https://gerrit.wikimedia.org/r/1025453

Per the above CR, alerts are now in place. As such, I'm closing out this ticket.

Change #1026950 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: remove backend failure check

https://gerrit.wikimedia.org/r/1026950

Change #1026950 merged by Bking:

[operations/puppet@production] elastic: remove backend failure check

https://gerrit.wikimedia.org/r/1026950