Creating this as a follow up task from this incident .
Data Platform SRE and Search Platform should be alerted when the connection error rate from Mediawiki app servers to search clusters goes above a certain percentage. See this Envoy Telemetry dashboard for an example of what we could/should be alerting on.
Creating this ticket to:
- Create alerts
-
Confirm operationDecided against deliberately triggering, as the recent incident and the promtool rules evaluation should be enough to give us confidence in the alert.