Page MenuHomePhabricator

Expose servers production status
Open, LowPublic

Description

In T320696: Reduce the count of Netbox devices with incorrect status and the email thread to "sre-at-large", it was decided (for great reasons!) to only track the physical server's status in Netbox and not the services status (eg. is the server seeing live traffic?).

However having a standardize way to know if a server is actually live (as in taking it down would impact production) is valuable for maintenance purposes.

A suggestion raised during the I/F meeting is to expose it as Prometheus metric, the main benefit is that it can be real time, compared to for example PuppetDB that is updated every 30min.

As one can't implement it for all the possible services/servers it needs to be a documented/standardized framework/mechanism. That would permit any service owners to implement it based on the conditions of their choice (eg. puppet variable, daemon running, etcd status).

Opening this task to gather feedback and ideas.

Event Timeline

ayounsi created this task.