SquareOne Dashboards
SquareOne dashboards are incident-focused monitoring interfaces designed for anyone, not just experts. They prioritise clarity, navigation, and accessibility for (but not limited to) incident responders. Rather than presenting comprehensive technical data, these dashboards serve as guided entry points that help engineers quickly assess a system's health, understand the impact, and navigate logically towards root causes and resolution.
What? vs Why? Dashboards
Most of our dashboards are "what" dashboards. A spike in error rates, degraded latency, dropped throughput. We can tell that something is wrong, but not why. Our service-specific dashboards have three problems:
- They're not approachable or easily understood by non-experts
- They're not easily discoverable
- Most critically, they're missing context
A "Why?" dashboard flips this. By surfacing underlying conditions and root causes rather than just symptoms, and by building with accessibility in mind, we enable "black box" troubleshooting for anyone, regardless of service familiarity.
Why Now?
We have a 24x7 oncall rotation. During incidents, response capacity may be limited not by lack of skilled people, but by knowledge silos. When on-call responders encounter unfamiliar systems, the mental load is high and the path forward is sometimes unclear. It is not uncommon for the root cause or important information to be hidden in dashboards most of the team weren't aware of. This fragmentation costs us time and creates unnecessary pressure on domain experts to be present during every incident.
Key Principles
- Make system health and impact obvious at a glance
- Designed for engineers with little familiarity with the system/service
- Help engineers answer "What's next?" (zooming into subsystems, runbooks, documentation, etc)
- Link other dashboards logically so engineers can zoom in and out of individual systems
- Provide quick access to other monitoring systems (eg logstash, turnilo) as well as relevant Wikitech Pages
- Include text boxes for helpful context (including temporary service announcements)
- Design Approach
- Clean Dashboard Layout
- Create library panels for consistency across SquareOne dashboard
- Use the same variables across dashboards for easy cross-dashboard navigation
- Create Text panels with links related to individual components (eg "Network")
Roadmap (provisional)
- OnCall SquareOne T414085
- MediaWiki SquareOne dashboards
- Kubernetes SquareOne dashboards
- Enable SRE teams to create "SquareOne" dashboards for critical systems under their care, so to distribute this work
Benefits
- Faster incident resolution: Responders can assess problems faster rather than waiting for expert input
- Broader response capacity: More SREs can contribute effectively during incidents, reducing pressure on domain experts
- Better on-call experience: Lower friction and clearer paths reduce on-call stress
- Reduced knowledge silos: Context becomes discoverable, not dependent on individual engineers
- Scalability: These paths could enable developers to troubleshoot as well, extending troubleshooting beyond just SREs