Create incident-focused SquareOne dashboards (could be more than one but no more than 3) that serve as the primary entry points for on-call responders during incidents.
Design
- Infobox with useful oncall information
- Traffic status (text+upload)
- Precooked Turnilo URLs
- Edge Caches status
- Mediawiki status (user facing, ie mw-web, mw-ext-int)
- Kubernetes status
- Database status
Motivation
Recently there was an alert for
FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
While there were indeed errors on eqiad, the actual problem was excess traffic on eqsin and magru. Given that we have seen other similar alerts
which had the same root cause (https://logstash.wikimedia.org/goto/6f87337a5da6ea02c30a2499f7711a14). We should consider attacghing a more diagnostic dashboard that helps quickly identify the underlying traffic issues rather than just the symptom of elevated errors.
Drafts
- https://grafana-rw.wikimedia.org/d/abb02966-5ee7-48dc-8d81-2163492ad3d7/oncall-square-one
- Text/Upload are part of T414665