Background
We want to introduce alert monitoring for our services so that we not only know when something goes wrong or is about to go wrong, but also to have a good pulse on the performance of our services.
Approach
- Use log severity levels to determine how or whether to alert
- Tools: Prometheus Alertmanager
- Determine the channels for alerting
- We are able to get alerts in these channels
Acceptance Criteria/Success Metrics
- We can detect when an outage is about to happen because of an alert
- If there is an outage, we can resolve it more quickly than we did for the last P0
Stretch Goal
- Alerting based on keywords