Page MenuHomePhabricator

Alert in need of triage: SystemdUnitFailed (instance stat1008:9100)
Closed, ResolvedPublic

Description

The alert SystemdUnitFailed has started firing 1 month ago.

Labels
alertname=SystemdUnitFailed
instance=stat1008:9100
name=uuidd.service
prometheus=ops
severity=critical
site=eqiad
source=prometheus
team=data-platform
Annotations
NameContent
dashboardhttps://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status
descriptionuuidd.service on stat1008:9100
runbookhttps://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
summaryuuidd.service on stat1008:9100
Links

Triage metadata. Do not delete.
fingerprint=314d7602912576ac

Event Timeline

Gehel triaged this task as High priority.Sep 2 2025, 1:42 PM
BTullis subscribed.

These alerts occur when the host is under stress from user activity. It is difficult to stop this happening, but also we don't really want to start excluding systemd service failures from the alerts.

I have cleared them for now.

btullis@stat1008:~$ systemctl --failed
  UNIT                LOAD   ACTIVE SUB    DESCRIPTION
● session-c1408.scope loaded failed failed Session c1408 of user debmonitor
● session-c1409.scope loaded failed failed Session c1409 of user debmonitor
● session-c1804.scope loaded failed failed Session c1804 of user debmonitor
● uuidd.service       loaded failed failed Daemon for generating UUIDs

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
4 loaded units listed.
btullis@stat1008:~$ sudo systemctl reset-failed 
btullis@stat1008:~$ systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.
btullis@stat1008:~$