At the current moment we don't have reliable monitoring and alerting to figure out if a service/isvc is totally off like what happened in T362503
We need to wait for tasks like T351390 to progress on the SLO dashboard/alerting front, so we should have something to use in the meantime.
Overall steps:
- Decide what metric(s) to alert on, the simpler the better.
- Add generic monitors that check all isvcs