Reduce alert noise associated with individual users' jupyterhub-singleuser services
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	May 18 2023, 4:22 PM

Description

We have recently observed a relatively high number of alerts for system unit failures on the stats servers relating to individual users' jupytherhub servers.

For example:

(SystemdUnitFailed) firing: (8) jupyter-aitolkyn-singleuser-conda-analytics.service Failed on stat1005:9100

These units might fail for a number of reasons, such as oom errors. They are transient units, created and managed by the service jupyterhub-conda.service

Generally, I think that these user units should be excluded from notification, since neither the data engineering team nor the wider SRE team need to know about the status of individual users' jupyterhub servers.

I'm open to other suggestions of how we manage these though, rather than simply removing them from the alert.

I'm primarily tagging this with Observability-Alerting although it relates to Data-Engineering servers, so I'll make sure it is seen and discussed within that team too.

Details

	Subject	Repo	Branch	Lines +/-
	Add an extra property 'CollectMode' to each user's jupyter service	operations/puppet	production	+2 -1
	Add an extra property 'CollectMode' to each user's jupyter service	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects

Mentioned In: T330671: stat1005: failing systemd job
T337471: Jupyterhub is broken in conda-analytics 0.0.13

Event Timeline

BTullis created this task.May 18 2023, 4:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2023, 4:22 PM

Thank you for reaching out @BTullis ! I'll enumerate what I think are our options:

Make sure transient units are collected, i.e. make sure the units run with CollectMode=inactive-or-failed (upstream docs, introduced before Stretch). This seems to be possible because systemdspawner supports passing custom properties. This is by far my favorite option since we get the behavior we want (systemd forgets about the unit once it exits, failed or otherwise)
Exclude unit name pattern from notifications only. We'd change the alertmanager routing configuration so it doesn't notify, and the alert still shows up on alerts.w.o. Not a great option and slightly more maintainable than the option below.
Exclude a unit name pattern from SystemdUnitFailed alert (e.g. name!~<something>). This is my least favorite option since we need to pack and maintain all exclusions in the expr field of the alert.

Hope that helps!

Thanks for the summary @fgiunchedi

I agree that option 1 is probably the best option here. I didn't know about the CollectMode option until now. I can make a patch and see if it works as we hope.

BTullis renamed this task from Exclude jupyterhub singleuser services from the systemd unit failure alerts to Reduce alert noise associated with individual users' jupyterhub-singleuser services.May 19 2023, 3:16 PM

BTullis triaged this task as Low priority.

BTullis added a project: Data-Platform-SRE.

BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.

BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

Change 921382 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add an extra property 'CollectMode' to each user's jupyter service

https://gerrit.wikimedia.org/r/921382

gerritbot added a project: Patch-For-Review.May 19 2023, 3:29 PM

BTullis claimed this task.May 19 2023, 3:31 PM

BTullis moved this task from In Progress to Needs Review on the Data-Platform-SRE board.May 22 2023, 8:42 AM

Change 921382 merged by Btullis:

[operations/puppet@production] Add an extra property 'CollectMode' to each user's jupyter service

https://gerrit.wikimedia.org/r/921382

Maintenance_bot removed a project: Patch-For-Review.May 25 2023, 11:11 AM

BTullis mentioned this in T337471: Jupyterhub is broken in conda-analytics 0.0.13.May 25 2023, 11:32 AM

The change was deployed, then jupyterhub broke.
The change was reverted immediately, but jupyterhub remained broken.

It seems that this change was unlrelated to the error, which seems to have been introduced with conda-analytics-0.0.13.
The trouble is that pushing out the conda-analytics package didn't restart the jupyterhub-conda service that depends on it, so we didn't know about the failure at the time.

I will deploy this change again, now that we know it was unrelated to the failure of jupyterhub

Change 926859 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] "Add an extra property 'CollectMode' to each user's jupyter service""

https://gerrit.wikimedia.org/r/926859

gerritbot added a project: Patch-For-Review.Jun 5 2023, 9:44 AM

Change 926859 merged by Btullis:

[operations/puppet@production] Add an extra property 'CollectMode' to each user's jupyter service

https://gerrit.wikimedia.org/r/926859

I have now deployed this change again.

Maintenance_bot removed a project: Patch-For-Review.Jun 5 2023, 12:10 PM

BTullis closed this task as Resolved.Jun 13 2023, 12:31 PM

BTullis mentioned this in T330671: stat1005: failing systemd job.Jul 11 2023, 9:03 PM

Gehel moved this task from Needs Reporting to Done on the Data-Platform-SRE board.Jul 19 2023, 8:51 AM

Reduce alert noise associated with individual users' jupyterhub-singleuser servicesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Reduce alert noise associated with individual users' jupyterhub-singleuser services
Closed, ResolvedPublic
Actions