What?
During the WE5-WE6 offsite, there was an insightful discussion between myself, @dancy, @brennen, and @jeena regarding what we examine following a train deployment or backport. I would like to attempt creating a dashboard that serves as a useful reference point for both train conductors and deployers alike. In other words, a single source of truth to answer the question: did that go smoothly?
Why?
At present, deployers, train conductors, and SREs lack a unified perspective on post-deployment health across production systems. This fragmentation results in several challenges:
- Misaligned understanding: Different teams interpret deployment success through different angles
- Information silos: Critical information remain scattered across dashboards and monitoring systems (including logspam)
- Slower incident response: Teams may context-switch between dashboads/systems to obtain a comprehensive picture
By establishing a single point of reference, we enable deployers and train conductors to develop a shared understanding and visibility of what has occurred in production, on par with how SREs monitor systems. This unified dashboard can incorporate links to subsystems, allowing teams to zoom in and out between high-level and detailed component-level health indicators.
Note: scap has been effective at detecting deploying issues; this dashboard would complement that by providing broader post-deployment visibility.
Requirements
Panels
ΤΒΑ