Page MenuHomePhabricator

create a place (whiteboard) where SRE advertises current site status / things for awareness
Open, Needs TriagePublic

Description

This ticket is an outcome from the incident review meeting for incident 2024-09-28 cr2-eqsin down.

During this incident one site was depooled due to a router failure while another site was already depooled due to another unrelated router failure.

As part of the follow-up discussion a need was identified for having a a central place (whiteboard) where SRE can quickly check the current site status.

It's supposed to be a place where the pool status for all sites is displayed and optionally other facts for general awareness can be advertised.
Unlike SAL, it should not be append-only but a single page that gets updated to current status.

Several people agreed we should create something like this.
All other details (internal vs external, technology used, etc) are to be determined.

Event Timeline

Any SRE can feel free to edit the ticket description if I missed something or to clarify. This was just a follow-up trying to remember from my personal meeting notes.