Page MenuHomePhabricator

Proof of Concept: SquareOne OnCall Dashboard
Open, In Progress, MediumPublic

Description

Create incident-focused SquareOne dashboards (could be more than one but no more than 3) that serve as the primary entry points for on-call responders during incidents.

Design

  • Infobox with useful oncall information
  • Traffic status (text+upload)
    • Precooked Turnilo URLs
  • Edge Caches status
  • Mediawiki status (user facing, ie mw-web, mw-ext-int)
  • Kubernetes status
  • Database status
Motivation

Recently there was an alert for

FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh

While there were indeed errors on eqiad, the actual problem was excess traffic on eqsin and magru. Given that we have seen other similar alerts
which had the same root cause (https://logstash.wikimedia.org/goto/6f87337a5da6ea02c30a2499f7711a14). We should consider attacghing a more diagnostic dashboard that helps quickly identify the underlying traffic issues rather than just the symptom of elevated errors.

Drafts

Event Timeline

jijiki changed the task status from Open to In Progress.Jan 9 2026, 12:53 PM
jijiki triaged this task as Medium priority.
jijiki edited projects, added ServiceOps new; removed serviceops-deprecated.
jijiki renamed this task from Create a dashboard to easily visualise upload/media issues to Create a dashboard to easily visualise capacity issues for OnCallers.Jan 12 2026, 5:15 PM
jijiki updated the task description. (Show Details)
jijiki renamed this task from Create a dashboard to easily visualise capacity issues for OnCallers to Proof of Concept: SquareOne OnCall Dashboard.Jan 12 2026, 5:44 PM

I like the idea of having a full-stack view of traffic problems, especially on the media-serving side of things, which I'd focus on more at first. I would actually create two separate dashboards for text and media-serving.

I like a lot less the "square one" name, but I won't shave that yak :)

I like the idea of having a full-stack view of traffic problems, especially on the media-serving side of things, which I'd focus on more at first. I would actually create two separate dashboards for text and media-serving.

I agree, it became quite evident as I was working on the OnCall one. I created some drafts last week, and opened T414665. I will add this to the description too for clarity. Thanks!

I like a lot less the "square one" name, but I won't shave that yak :)

Naming things is hard. I was wondering too what better naming scheme would work for us, but decided to revisit it later.