ipmiseld not running reliably
Open, MediumPublic
Actions

Assigned To

None

Authored By

	MoritzMuehlenhoff
	Mar 31 2022, 1:24 PM

Description

After activating automated restarts for ipmiseld earlier the day, various alerts popped up where wmf-auto-restart failed to restart because ipmiseld wasn't running in the first place.

This happened on stat1005/1008, wdqs2003/2005/2006/2008 and a handful of other servers.

This warrants some further investigation, did ipmiseld not start due to some hardware/firmware issue in the first place, or did it maybe crash and wasn't restarted?

Details

	Subject	Repo	Branch	Lines +/-
	ipmiseld: ensure service enabled and running	operations/puppet	production	+8 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T305147 ipmiseld not running reliably
		Resolved		Papaul	T336031 Update firmware for db2180

Event Timeline

MoritzMuehlenhoff created this task.Mar 31 2022, 1:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 31 2022, 1:24 PM

Looks like ipmiseld isn't enabled on a sampling of these hosts, letting puppet ensure the service is enabled and running seems like a good next step

stat1005:~$ systemctl status ipmiseld.service
● ipmiseld.service - IPMI SEL syslog logging daemon
   Loaded: loaded (/lib/systemd/system/ipmiseld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

wdqs2003:~$ systemctl status ipmiseld.service
● ipmiseld.service - IPMI SEL syslog logging daemon
   Loaded: loaded (/lib/systemd/system/ipmiseld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Change 775875 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] ipmiseld: ensure service enabled and running

https://gerrit.wikimedia.org/r/775875

gerritbot added a project: Patch-For-Review.Mar 31 2022, 2:42 PM

herron triaged this task as Medium priority.Mar 31 2022, 3:05 PM

Looking more closely I see all bullseye hosts have the unit enabled, while all buster hosts do not.

cumin1001:~$ sudo cumin -p 0 'F:has_ipmi = true and F:lsbdistcodename = buster' 'systemctl is-enabled ipmiseld.service'
----- OUTPUT of 'systemctl is-ena...ipmiseld.service' -----
disabled

cumin1001:~$ sudo cumin -p 0 'F:has_ipmi = true and F:lsbdistcodename = bullseye' 'systemctl is-enabled ipmiseld.service'
----- OUTPUT of 'systemctl is-ena...ipmiseld.service' -----
enabled

Also, each system in the description has an uptime of less than 1 week, with the wdqs hosts being closer to 1 day. Looks to me like the service simply did not restart after recent reboot, and did not trigger icinga systemd state alerts because they were inactive as opposed to failed.

I had a closer look at the source packages and this is caused by debian/rules file in Buster; it misses to invoke the systemd addon for debhelper and as such, the service doesn't get enabled. And Stretch doesn't have a systemd unit at all, so the auto-translated sysvinit script gets correctly started.

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?
Anyway, let's simply fix this via Puppet for Buster hosts.

In T305147#7824394, @MoritzMuehlenhoff wrote:

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?

When this service was initially enabled it caused a shower of systemd unit alerts because the package provided configuration referenced a directory that did not exist. That was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/767223/ and I issued a mkdir and service restart across ipmi hosts to speed up recovery. That manual restart likely masked this problem until it resurfaced after reboots.

In T305147#7833154, @herron wrote:

In T305147#7824394, @MoritzMuehlenhoff wrote:

There's some things which are still puzzling here: Why wasn't this noticed before, was the service manually started before? And if the service wasn't running on the vast majority of our fleet (2/3 are Buster after all), did we miss logged errors this way?

When this service was initially enabled it caused a shower of systemd unit alerts because the package provided configuration referenced a directory that did not exist. That was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/767223/ and I issued a mkdir and service restart across ipmi hosts to speed up recovery. That manual restart likely masked this problem until it resurfaced after reboots.

Gotcha, that explains :-)

Change 775875 merged by Herron:

[operations/puppet@production] ipmiseld: ensure service enabled and running

https://gerrit.wikimedia.org/r/775875

Maintenance_bot removed a project: Patch-For-Review.Apr 6 2022, 5:31 PM

lmata moved this task from Inbox to Radar on the observability board.Jun 24 2022, 1:44 PM

@herron we've seen this alert being flapping on db2180 a lot lately:

[08:15:42]  <jinxer-wm> (SystemdUnitFailed) firing: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:42]  <jinxer-wm> (SystemdUnitFailed) resolved: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

Even during the time of the alert, ipmi-tool keeps working fine. So I am not sure what's going on. Also, there are no wikitech page associated to it and I haven't been able to find any while searching. Can you give some insights here?
Thanks!

In T305147#8828793, @Marostegui wrote:
@herron we've seen this alert being flapping on db2180 a lot lately:
[08:15:42]  <jinxer-wm> (SystemdUnitFailed) firing: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:30:42]  <jinxer-wm> (SystemdUnitFailed) resolved: ipmiseld.service Failed on db2180:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
Even during the time of the alert, ipmi-tool keeps working fine. So I am not sure what's going on. Also, there are no wikitech page associated to it and I haven't been able to find any while searching. Can you give some insights here?
Thanks!

ipmiseld could not query the IPMI state of the server multiple times (error message is ipmi_sdr_cache_create: internal ipmi error). We've seen that before at https://phabricator.wikimedia.org/T167121 and given that the errors are reported by the hardware, I'd say a full firmware update of 2180 would be the next step (maybe we can bundle that with the current reboots)

Thanks Moritz, I will work with DCOps on that.

Papaul closed subtask T336031: Update firmware for db2180 as Resolved.May 9 2023, 3:06 PM

Marostegui reopened subtask T336031: Update firmware for db2180 as Open.May 9 2023, 3:21 PM

Marostegui closed subtask T336031: Update firmware for db2180 as Resolved.May 10 2023, 5:41 AM

ipmiseld not running reliablyOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ipmiseld not running reliably
Open, MediumPublic
Actions

Related Objects
Search...