What/why:
Our current #aw-alerts Slack channel fire notifications we do not feel the need to pay much heed to as our logs have been our largest source of informative data thus far. Our alerting alerts when there are certain errors or warnings have reached a threshold, but that doesn't always mean an incident which is one of the reasons we don't listen to the notifications.
Our recent Evaluator outages have shown us that alerting on indeed more urgent events may be necessary. It may be worth mute/disable/removing the current alerts and replace them with an alert indicating a possible incident such as an Evaluator outage. This also will allow us to have clear separation between that channel and the dashboards we daily monitor by way of Chores.
How:
- Ensure Orch has backwards compatibility with new Eval error codes
- Evaluator HTTP status codes to be live and firing
- Make sure it is working and live again on our Grafana board
- Create new alerting rules
- Wire up the new rule on Grafana Alerting (which fires to #aw-alerts)