Page MenuHomePhabricator

mediawiki-event-enrichment jobs alerting
Closed, ResolvedPublic

Description

As part of GA internal release for T307959, we should set up (hopefully automated?) alerts for deployed Flink based enrichment jobs. Specifically, we need alerts now that will let us know if we are not meeting the SLOs for this job.

Since the metrics for these kinds of jobs should be standardized, it would be nice if we could define these alerts in an automated/parameterized way, or at least with a documented process to repeat the process when a new enrichment job is created and deployed.

Alerts might include things like

  • Input / output throughput ratio (should match some %)?
  • error event rate
  • lag / backpressure

There are also Flink level operation alerts as well. These will likely be applicable to all flink apps in k8s that use the flink-app helm chart (and the flink-kubernetes-operator).

  • # active TaskManagers and JobManagers
  • Job state
  • failed checkpoints
  • checkpoint rate
  • failovers?
  • watermark lag?

The above are just guesses at the types of things Flink app maintainers would like to be alerted on. We probably don't need all of these before we do GA release of T307959, but important ones (like is the app running, error rate, etc.), we should do.

Event Timeline

gmodena renamed this task from Flink Enrichment job alerting to mediawiki-event-enrichment jobs alerting.Jul 4 2023, 11:38 AM
gmodena claimed this task.

I consulted with SRE, and for prometheus based metrics we should be able to create alerts with Alertmanager.

We need to decide to what email address and/or IRC channels these events will be routed to.

We need to decide to what email address and/or IRC channels these events will be routed to.

To start, maybe we could use the data-engineering group (cc @BTullis) https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/modules/alertmanager/templates/alertmanager.yml.erb#73 and respective alert configs https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-engineering/

Change 936096 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] data-engineering: add alerts for mw-page-content-change-enrich.

https://gerrit.wikimedia.org/r/936096

The attached patch is a first WIP / learning for setting up alerting. There's a couple of design decision and best practices that I'd need guidance with. I did test locally with promtool before pushing, but I see CI is failing. I need to investigate.

@BTullis @fgiunchedi @bking does the attached config provide the basic info you'd expect to see? Would you need additional labels / details?

There are two type of systems we need to alert on, with different audiences:

  • flink kubernetes operator: this is the cluster management system. It is something that we (Event Platform) should have an overview of. If it is unhealthy all flink applications (client teams) could be impacted. This is a piece of infra we share ownership with (infromally?) with Search, and we should liaise to what alerting level is required and who receives them.
  • flink applications: this will be owned by client teams. For them, Flink is an implementation detail. Each app should have an SLO and messages should be routed to the respective teams/DRIs (e.g. not necessarily Event Platform).

In the attached patch I added minimal configs for the operator (flink-kubernetes-operator) and an application (mw-page-content-change-enrich) under team-data-engineering. This is a starting point. I would like to get to a stage were we have a process in place that allows iteration.
They both are under team-data-engineering because for this application we (Event Platform) are owners and responsible for the SLO.

I was thinking about the following:

  • flink operator should be owned by data engineering / search or a new shared group.
  • each application should have it's own config file, and belong to the owners group (this might require onboarding new teams to Alertmanager). To me this seems more manageable than a single file with multiple applications.
  • might we want to group applications together at a later stage, can we create a subdir inside the team path? Are there refactoring strategies I should look into for DRY configs?
  • would it be possible to route alerts to different targets within a team? For example: have Event Platform SWEs and SREs to receive flink-kubernets-operator alerts, but route mw-page-content-change-enrich only to Event Platform SWEs

What do you think?
Happy to discuss further.

The attached patch is a first WIP / learning for setting up alerting. There's a couple of design decision and best practices that I'd need guidance with. I did test locally with promtool before pushing, but I see CI is failing. I need to investigate.

@BTullis @fgiunchedi @bking does the attached config provide the basic info you'd expect to see? Would you need additional labels / details?

Thank you for reaching out; I gave your review a quick pass and overall LGTM.
re: CI failures, what I usually do is run tox locally and that will run all tests that CI does (there's a couple of requirements documented in README.md)

There's two type of systems we need to alert on, with differen audiences:

  • flink kubernetes operator: this is the cluster management system. It is something that we (Event Platform) should have an overview of. If it is unhealthy all flink applications (client teams) could be impacted. This is a piece of infra we share ownership with (infromally?) with Search, and we should liaise to what alerting level is required and who receives them.
  • flink applications: this will be owned by client teams. For them, Flink is an implementation detail. Each app should have an SLO and messages should be routed to the respective teams/DRIs (e.g. not necessarily Event Platform).

In the attached patch I added minimal configs for the operator (flink-kubernetes-operator) and an application (mw-page-content-change-enrich) under team-data-engineering. This is a starting point. I would like to get to a stage were we have a process in place that allows iteration.
They both are under team-data-engineering because for this application we (Event Platform) are owners and responsible for the SLO.

This all seems sensible to me! re: SLOs and alerting, I know @herron is investigating centralised SLOs under a single interface at T302995: Explore Pyrra for SLO Visualization and Management, though I don't know how/if that impacts anything here

I was thinking about the following:

  • flink operator should be owned by data engineering / search or a new shared group.

The way alerts work now we support only one team per alert, therefore for shared ownership a new group/team will have to be defined

  • each application should have it's own config file, and belong to the owners group (this might require onboarding new teams to Alertmanager). Or would it be possible to route alerts to different targets within a group?

I think a new team/group explicitly defined in the alerts will work better. We could technically do the routing within a single group though that can get more complicated, a maintenance burden, and it isn't self-service for users (alerts.git has wider merge rights than puppet.git, where the routing configuration lives)

thanks for the feedback @fgiunchedi! I'll look into tox.

One follow up questions re structuring code:

  • might we want to group applications together at a later stage, can we create a subdir inside the team path? Are there refactoring strategies I should look into for DRY configs?

This all seems sensible to me! re: SLOs and alerting, I know @herron is investigating centralised SLOs under a single interface at T302995: Explore dedicated (non-grafana) SLO Visualization and Management, though I don't know how/if that impacts anything here

This is really cool! Thanks for the pointer.

I think a new team/group explicitly defined in the alerts will work better. We could technically do the routing within a single group though that can get more complicated, a maintenance burden, and it isn't self-service for users (alerts.git has wider merge rights than puppet.git, where the routing configuration lives)

Ack. Thanks for clarifying. cc @BTullis

thanks for the feedback @fgiunchedi! I'll look into tox.

One follow up questions re structuring code:

  • might we want to group applications together at a later stage, can we create a subdir inside the team path? Are there refactoring strategies I should look into for DRY configs?

Subdirectories within team directories are not supported at the moment (I haven't tried though I don't think it'll work out of the box). To DRY configs the easiest I can think of is using yaml anchors (examples will come up by grepping for <<: in alerts.git), bear in mind that those come with their own caveats (e.g. works only within the same file, overriding keys within nested objects can't be done, etc) but can result in a fair amount of DRYness if e.g. you are only changing expr. Happy to discuss further when the time comes though and we have a better idea of the "shape" of such per-app alerts.

HTH!

To DRY configs the easiest I can think of is using yaml anchors (examples will come up by grepping for <<: in alerts.git), bear in mind that those come with their own caveats (e.g. works only within the same file, overriding keys within nested objects can't be done, etc) but can result in a fair amount of DRYness if e.g. you are only changing expr. Happy to discuss further when the time comes though and we have a better idea of the "shape" of such per-app alerts.

Good to know. This def helps. I will ping you if / when we need to instrument a second application.

Thanks again!

@fgiunchedi do you maybe have any recommendation re handling maintenance windows that might fire alerts?

For example: we can expect high kafka consumer lag (above threshold) during a backfill. This can happen during planned maintenance. Would you snooze or temporarily disable the alert, or just live with it till the metric stabilizes?

So cool!

Perhaps for a future patch: I wonder if we can get some of the 'app is up' alerts from flink operator metrics, e.g. flink_k8soperator_namespace_JmDeploymentStatus_.*_Count. I think this should work for most cases, unless the app release is manually deleted from kubernetes and forgotten to put back.

@fgiunchedi do you maybe have any recommendation re handling maintenance windows that might fire alerts?

For example: we can expect high kafka consumer lag (above threshold) during a backfill. This can happen during planned maintenance. Would you snooze or temporarily disable the alert, or just live with it till the metric stabilizes?

Issuing silences for alerts ahead of maintenance is definitely preferred, you can do that from https://alerts.wikimedia.org interface. You can schedule silences in the future and they can match any tag in the alert, even if the alert doesn't exist yet or is not firing. See also https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements

Perhaps for a future patch: I wonder if we can get some of the 'app is up' alerts from flink operator metrics, e.g. flink_k8soperator_namespace_JmDeploymentStatus_.*_Count. I think this should work for most cases, unless the app release is manually deleted from kubernetes and forgotten to put back.

+1. I'd like an overview of what's happening on the platform (at least at WARN level).

I think we'll need to figure out the routing/group assignment first. Right now data-engineering would get deployment status alerts twice (from the app and from the operator).

Change 936096 merged by jenkins-bot:

[operations/alerts@master] data-engineering: add alerts flink enrichment apps

https://gerrit.wikimedia.org/r/936096

Change 940341 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] data-engineering: lower server for flink enrcihment app

https://gerrit.wikimedia.org/r/940341

Change 940341 merged by jenkins-bot:

[operations/alerts@master] data-engineering: lower serverity for flink enrcihment app

https://gerrit.wikimedia.org/r/940341

Change 951959 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] data-engineering: flink: alert when TM is missing for 5m.

https://gerrit.wikimedia.org/r/951959

Change 951959 merged by jenkins-bot:

[operations/alerts@master] data-engineering: flink: alert when TM is missing for 5m.

https://gerrit.wikimedia.org/r/951959