Page MenuHomePhabricator

Proof of Concept: Train Health Dashboard
Open, LowPublic

Description

What?

During the WE5-WE6 offsite, there was an insightful discussion between myself, @dancy, @brennen, and @jeena regarding what we examine following a train deployment or backport. I would like to attempt creating a dashboard that serves as a useful reference point for both train conductors and deployers alike. In other words, a single source of truth to answer the question: did that go smoothly?

Why?

At present, deployers, train conductors, and SREs lack a unified perspective on post-deployment health across production systems. This fragmentation results in several challenges:

  • Misaligned understanding: Different teams interpret deployment success through different angles
  • Information silos: Critical information remain scattered across dashboards and monitoring systems (including logspam)
  • Slower incident response: Teams may context-switch between dashboads/systems to obtain a comprehensive picture

By establishing a single point of reference, we enable deployers and train conductors to develop a shared understanding and visibility of what has occurred in production, on par with how SREs monitor systems. This unified dashboard can incorporate links to subsystems, allowing teams to zoom in and out between high-level and detailed component-level health indicators.

Note: scap has been effective at detecting deploying issues; this dashboard would complement that by providing broader post-deployment visibility.

Requirements

Panels
ΤΒΑ

Event Timeline

JMeybohm edited projects, added ServiceOps-Mediawiki; removed serviceops.
JMeybohm added a project: ServiceOps new.
JMeybohm moved this task from Inbox to Backlog on the ServiceOps new board.

Tyler and I talked about it earlier this week. I am always looking at the MediaWiki log buckets on https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging , a dashboard I have tweaked a few more times.

  • the breakdown per severity would shows spike in non error levels could be the sign of an actual issue which is otherwise filtered out in the OpenSearch dashboard we use.
  • the per channel histogram and filtering per channel let me easily pinpoint which log bucket is the source. That is dramatically faster to do it there than in OpenSearch dashboard.