We have recently observed a relatively high number of alerts for system unit failures on the stats servers relating to individual users' jupytherhub servers.
For example:
(SystemdUnitFailed) firing: (8) jupyter-aitolkyn-singleuser-conda-analytics.service Failed on stat1005:9100
These units might fail for a number of reasons, such as oom errors. They are transient units, created and managed by the service jupyterhub-conda.service
Generally, I think that these user units should be excluded from notification, since neither the data engineering team nor the wider SRE team need to know about the status of individual users' jupyterhub servers.
I'm open to other suggestions of how we manage these though, rather than simply removing them from the alert.
I'm primarily tagging this with Observability-Alerting although it relates to Data-Engineering servers, so I'll make sure it is seen and discussed within that team too.