Page MenuHomePhabricator

adjust frequency of alerts
Closed, ResolvedPublic

Description

What/Why:
Last year, @colewhite helped us set up our Slack alerting in #aw-alerts, based off of this dashboard. Since then, our team has officially kicked off our very own rotation of 'Chores'. However, due to the number of granular events we scrutinize in our logs and the ambiguity of many of them, we have hardly relied on the alerts firing in Slack. Not to mention, there are a couple other reasons that compel us to want to update our alerting:

  1. The original set up was based off of the dashboard with minimal filtering, we should also have the channel filtered for 'WikiLambda' as well so we don't get lost in other events outside of our product, i.e. dashboard with necessary filters only.
  2. Just before the holiday 'code freeze' late Nov-early Dec. '24, our backend services logging were migrated to use ecs formatting which led to us to create a separate dashboard, here.

This is the inaugural config to our alerting, thanks to Cole!

How:

  1. Adjust the Prometheus config to add the channel field set to WikiLambda, also double-check that kubernetes.namespace_name is set to mw-wikifunctions, in addition to kubernetes.namespace_name: wikifunctions.
    1. Adjust to fire on severity levels: WARNING or ERROR (I don't see levels of higher possibility than those like I did last year for some reason!)
    2. Adjust them to fire only per 5 events per 5 minutes (rather than 1 min).
  2. Add to Prometheus config the logs in the new 'backend' dashboard which uses ecs logging
    1. Set them to fire on: WARN, WARNING, ERR`, ERROR, CRITICAL, at similar frequency to previous

Event Timeline

  • Would adding the log events from the ecs dashboard be pretty straightforward?
  • Please lmk if this is something I can be directed to handle as well, just tagging experts at Observability before I attempt something terrible:P

TY as always!

fgiunchedi subscribed.

Thank you for reaching out @ecarg ! I'll move this task to our inbox and we'll triage it on wed (if not earlier!)

Grace todo:

from David S:

Hey @ecarg ,
Let's start with the current alerts:
This Alert is firing every time that we have a "warning" in our logs alert. Can we list what triggers a warning in the logs to better understand the source of this alert?
Looking at the graph seems like we are still logging a lot of this, maybe reduce to an interval of 60000 ms (a minute) and a counter of 4, for example, would make this alert little bit noise, we could focus on fixing those cases and then make it louder. wdyt?

  • we had WARNING and FATAL logs accumulating quite a lot around this time last year; from my recollection, they were mostly NodeJS events saying that JS Memory Allocation failed and the corresponding various log noise about Garbage Collection.

However since moving over ecs logs over the holidays, we don't see these events anymore. And I am not too sure why because we didn't push out such a major service improvement that could have suddenly 'fixed' this. We were still observing these log events and creating 'spike' tasks on how to mitigate and improve memory usage.

Next steps:

  • Where did these warnings go? Did something really get solved?
  • Why and where are the warning logs occurring now if said logs are no longer firing?

Note/observations:

  • Based off of the logs in the dashboard, they don't seem to exactly align with the event/spikes on the logs Grafana. Maybe this needs reconfiguring?

Side Q:

  • Is there a good way to read the events that fire onto this alerting dashboard? And how does it get marked as 'Resolved' automatically?

Todos:

  • Adjust frequency (as describe in OG task description)
  • Add another filter to the queries: channel: WikiLambda

Change #1129927 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] es_exporter: constrain wikifunctions query

https://gerrit.wikimedia.org/r/1129927

Change #1129936 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] es_exporter: add metric gathering for wikifunctions backend services

https://gerrit.wikimedia.org/r/1129936

Change #1129927 merged by Cwhite:

[operations/puppet@production] es_exporter: constrain wikifunctions query

https://gerrit.wikimedia.org/r/1129927

Change #1129936 merged by Cwhite:

[operations/puppet@production] es_exporter: add metric gathering for wikifunctions backend services

https://gerrit.wikimedia.org/r/1129936

alert evaluation frequency is 5 minutes and only fires if it stays in violation of the threshold for 10 minutes

notes:

  • use 'Alerting' section in Grafana to adjust frequency and level-based alerting
  • find out how to make Alert notifications more informative, reference

✔️ hooked up the backend metrics to alerting dashboard
✔️ added alerting rule