Page MenuHomePhabricator

ipmiseld not running reliably
Open, MediumPublic

Description

After activating automated restarts for ipmiseld earlier the day, various alerts popped up where wmf-auto-restart failed to restart because ipmiseld wasn't running in the first place.

This happened on stat1005/1008, wdqs2003/2005/2006/2008 and a handful of other servers.

This warrants some further investigation, did ipmiseld not start due to some hardware/firmware issue in the first place, or did it maybe crash and wasn't restarted?

Event Timeline

Looks like ipmiseld isn't enabled on a sampling of these hosts, letting puppet ensure the service is enabled and running seems like a good next step

stat1005:~$ systemctl status ipmiseld.service
● ipmiseld.service - IPMI SEL syslog logging daemon
   Loaded: loaded (/lib/systemd/system/ipmiseld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
wdqs2003:~$ systemctl status ipmiseld.service
● ipmiseld.service - IPMI SEL syslog logging daemon
   Loaded: loaded (/lib/systemd/system/ipmiseld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Change 775875 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] ipmiseld: ensure service enabled and running

https://gerrit.wikimedia.org/r/775875

herron triaged this task as Medium priority.Mar 31 2022, 3:05 PM

Looking more closely I see all bullseye hosts have the unit enabled, while all buster hosts do not.

cumin1001:~$ sudo cumin -p 0 'F:has_ipmi = true and F:lsbdistcodename = buster' 'systemctl is-enabled ipmiseld.service'
----- OUTPUT of 'systemctl is-ena...ipmiseld.service' -----
disabled
cumin1001:~$ sudo cumin -p 0 'F:has_ipmi = true and F:lsbdistcodename = bullseye' 'systemctl is-enabled ipmiseld.service'
----- OUTPUT of 'systemctl is-ena...ipmiseld.service' -----
enabled

Also, each system in the description has an uptime of less than 1 week, with the wdqs hosts being closer to 1 day. Looks to me like the service simply did not restart after recent reboot, and did not trigger icinga systemd state alerts because they were inactive as opposed to failed.

I had a closer look at the source packages and this is caused by debian/rules file in Buster; it misses to invoke the systemd addon for debhelper and as such, the service doesn't get enabled. And Stretch doesn't have a systemd unit at all, so the auto-translated sysvinit script gets correctly started.

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?
Anyway, let's simply fix this via Puppet for Buster hosts.

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?

When this service was initially enabled it caused a shower of systemd unit alerts because the package provided configuration referenced a directory that did not exist. That was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/767223/ and I issued a mkdir and service restart across ipmi hosts to speed up recovery. That manual restart likely masked this problem until it resurfaced after reboots.

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?

When this service was initially enabled it caused a shower of systemd unit alerts because the package provided configuration referenced a directory that did not exist. That was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/767223/ and I issued a mkdir and service restart across ipmi hosts to speed up recovery. That manual restart likely masked this problem until it resurfaced after reboots.

Gotcha, that explains :-)

Change 775875 merged by Herron:

[operations/puppet@production] ipmiseld: ensure service enabled and running

https://gerrit.wikimedia.org/r/775875

@herron we've seen this alert being flapping on db2180 a lot lately:

[08:15:42]  <jinxer-wm> (SystemdUnitFailed) firing: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:42]  <jinxer-wm> (SystemdUnitFailed) resolved: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

Even during the time of the alert, ipmi-tool keeps working fine. So I am not sure what's going on. Also, there are no wikitech page associated to it and I haven't been able to find any while searching. Can you give some insights here?
Thanks!

@herron we've seen this alert being flapping on db2180 a lot lately:

[08:15:42]  <jinxer-wm> (SystemdUnitFailed) firing: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:42]  <jinxer-wm> (SystemdUnitFailed) resolved: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

Even during the time of the alert, ipmi-tool keeps working fine. So I am not sure what's going on. Also, there are no wikitech page associated to it and I haven't been able to find any while searching. Can you give some insights here?
Thanks!

ipmiseld could not query the IPMI state of the server multiple times (error message is ipmi_sdr_cache_create: internal ipmi error). We've seen that before at https://phabricator.wikimedia.org/T167121 and given that the errors are reported by the hardware, I'd say a full firmware update of 2180 would be the next step (maybe we can bundle that with the current reboots)

Thanks Moritz, I will work with DCOps on that.