What/Why:
Last year, @colewhite helped us set up our Slack alerting in #aw-alerts, based off of this dashboard. Since then, our team has officially kicked off our very own rotation of 'Chores'. However, due to the number of granular events we scrutinize in our logs and the ambiguity of many of them, we have hardly relied on the alerts firing in Slack. Not to mention, there are a couple other reasons that compel us to want to update our alerting:
- The original set up was based off of the dashboard with minimal filtering, we should also have the channel filtered for 'WikiLambda' as well so we don't get lost in other events outside of our product, i.e. dashboard with necessary filters only.
- Just before the holiday 'code freeze' late Nov-early Dec. '24, our backend services logging were migrated to use ecs formatting which led to us to create a separate dashboard, here.
This is the inaugural config to our alerting, thanks to Cole!
How:
- Adjust the Prometheus config to add the channel field set to WikiLambda, also double-check that kubernetes.namespace_name is set to mw-wikifunctions, in addition to kubernetes.namespace_name: wikifunctions.
- Adjust to fire on severity levels: WARNING or ERROR (I don't see levels of higher possibility than those like I did last year for some reason!)
- Adjust them to fire only per 5 events per 5 minutes (rather than 1 min).
- Add to Prometheus config the logs in the new 'backend' dashboard which uses ecs logging
- Set them to fire on: WARN, WARNING, ERR`, ERROR, CRITICAL, at similar frequency to previous