Page MenuHomePhabricator

Reduce alert noise associated with individual users' jupyterhub-singleuser services
Closed, ResolvedPublic

Description

We have recently observed a relatively high number of alerts for system unit failures on the stats servers relating to individual users' jupytherhub servers.

For example:

(SystemdUnitFailed) firing: (8) jupyter-aitolkyn-singleuser-conda-analytics.service Failed on stat1005:9100

These units might fail for a number of reasons, such as oom errors. They are transient units, created and managed by the service jupyterhub-conda.service

Generally, I think that these user units should be excluded from notification, since neither the data engineering team nor the wider SRE team need to know about the status of individual users' jupyterhub servers.

I'm open to other suggestions of how we manage these though, rather than simply removing them from the alert.

I'm primarily tagging this with Observability-Alerting although it relates to Data-Engineering servers, so I'll make sure it is seen and discussed within that team too.

Event Timeline

Thank you for reaching out @BTullis ! I'll enumerate what I think are our options:

  1. Make sure transient units are collected, i.e. make sure the units run with CollectMode=inactive-or-failed (upstream docs, introduced before Stretch). This seems to be possible because systemdspawner supports passing custom properties. This is by far my favorite option since we get the behavior we want (systemd forgets about the unit once it exits, failed or otherwise)
  2. Exclude unit name pattern from notifications only. We'd change the alertmanager routing configuration so it doesn't notify, and the alert still shows up on alerts.w.o. Not a great option and slightly more maintainable than the option below.
  3. Exclude a unit name pattern from SystemdUnitFailed alert (e.g. name!~<something>). This is my least favorite option since we need to pack and maintain all exclusions in the expr field of the alert.

Hope that helps!

Thanks for the summary @fgiunchedi

I agree that option 1 is probably the best option here. I didn't know about the CollectMode option until now. I can make a patch and see if it works as we hope.

BTullis renamed this task from Exclude jupyterhub singleuser services from the systemd unit failure alerts to Reduce alert noise associated with individual users' jupyterhub-singleuser services.May 19 2023, 3:16 PM
BTullis triaged this task as Low priority.
BTullis added a project: Data-Platform-SRE.
BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

Change 921382 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add an extra property 'CollectMode' to each user's jupyter service

https://gerrit.wikimedia.org/r/921382

Change 921382 merged by Btullis:

[operations/puppet@production] Add an extra property 'CollectMode' to each user's jupyter service

https://gerrit.wikimedia.org/r/921382

The change was deployed, then jupyterhub broke.
The change was reverted immediately, but jupyterhub remained broken.

It seems that this change was unlrelated to the error, which seems to have been introduced with conda-analytics-0.0.13.
The trouble is that pushing out the conda-analytics package didn't restart the jupyterhub-conda service that depends on it, so we didn't know about the failure at the time.

I will deploy this change again, now that we know it was unrelated to the failure of jupyterhub

Change 926859 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] "Add an extra property 'CollectMode' to each user's jupyter service""

https://gerrit.wikimedia.org/r/926859

Change 926859 merged by Btullis:

[operations/puppet@production] Add an extra property 'CollectMode' to each user's jupyter service

https://gerrit.wikimedia.org/r/926859

I have now deployed this change again.