Page MenuHomePhabricator

Some systemd services appear to be broken on all VMs
Open, LowPublic

Description

All VMs I've checked, at least have these persistent failures in systemd. We can monitor systemd from metrics infra, but it would be extremely noisy right now because of this.

  • cloud-final.service
  • cloud-init.service
  • smartd.service
  • prometheus_ssh_open_sessions.timer (fails whenever you are *not* logged in, making troubleshooting interesting)

Event Timeline

Bstorm created this task.

I don't imagine the virtual hard drives are effectively monitored by smartd, so I think that probably should just be removed.

Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Opened configuration file /etc/smartd.conf
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devi
ces
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Device: /dev/sda, opened
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Device: /dev/sda, [QEMU     QEMU HARDDISK    2.5+], 21.4 GB
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Device: /dev/sda, IE (SMART) not enabled, skip device
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Try 'smartctl -s on /dev/sda' to turn on SMART features
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Device: /dev/sdb, opened
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Device: /dev/sdb, [QEMU     QEMU HARDDISK    2.5+], S/N: 15f73142-6067-49a7-94a8-24d6e9d22cae, 42.9 GB
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Device: /dev/sdb, IE (SMART) not enabled, skip device
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Try 'smartctl -s on /dev/sdb' to turn on SMART features
Jul 23 22:21:56 toolsbeta-harbor-1 smartd[15151]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Jul 23 22:21:56 toolsbeta-harbor-1 systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Jul 23 22:21:56 toolsbeta-harbor-1 systemd[1]: smartd.service: Failed with result 'exit-code'.

Can confirm that you cannot enable that on our qemu disks. We should disable that unit.

Change 708796 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Don't install smart drive tools on cloud VMs

https://gerrit.wikimedia.org/r/708796

Change 708796 merged by Andrew Bogott:

[operations/puppet@production] Don't run smart drive drive check

https://gerrit.wikimedia.org/r/708796

Change 708807 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloud instances: further attempt to clear the failed state of smartd

https://gerrit.wikimedia.org/r/708807

Change 708807 merged by Andrew Bogott:

[operations/puppet@production] Cloud instances: further attempt to clear the failed state of smartd

https://gerrit.wikimedia.org/r/708807

Change 708810 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud instances: /qualify/path to systemd

https://gerrit.wikimedia.org/r/708810

Change 708810 merged by Andrew Bogott:

[operations/puppet@production] cloud instances: /qualify/path to systemd

https://gerrit.wikimedia.org/r/708810

Change 708826 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud instance vendordata: exit 0 at the end of the custom boot script

https://gerrit.wikimedia.org/r/708826

Change 708826 merged by Andrew Bogott:

[operations/puppet@production] cloud instance vendordata: exit 0 at the end of the custom boot script

https://gerrit.wikimedia.org/r/708826

the smartd issue is now resolved, if clumsily -- future base images won't have it installed so the puppet code I added will be a no-op outside of the base-image build process.

The cloud-init things are annoying to me; cloud-init seems to report failure after succeeding.

Change 708868 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps cloud-init: don't mask the puppet service

https://gerrit.wikimedia.org/r/708868

Change 708868 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps cloud-init: don't mask the puppet service

https://gerrit.wikimedia.org/r/708868

Change 708869 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps cloud-init: more tweaks to try to get a perfectly clean run

https://gerrit.wikimedia.org/r/708869

Change 708869 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps cloud-init: more tweaks to try to get a perfectly clean run

https://gerrit.wikimedia.org/r/708869

Change 708879 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Partially revert \"cloud-vps cloud-init: more tweaks to try to get a perfectly clean run\"

https://gerrit.wikimedia.org/r/708879

Change 708879 merged by Andrew Bogott:

[operations/puppet@production] Partially revert \"cloud-vps cloud-init: more tweaks\"

https://gerrit.wikimedia.org/r/708879

Cloud-init is not providing me a ton of info about why it's exiting with non-zero, but I'm pretty sure it does that any time any one of its modules fails. I've cleaned up the easy-to-fix bits, but that leaves the puppet module.

Cloud-init has a puppet module which I would like to use to set up puppet.conf -- it's a much more elegant process than the sed-heavy scripting we did before.

Unfortunately, the cloud-init puppet module also starts the puppet agent at the end. We do not currently use the puppet agent; rather we use a cron that calls run-puppet-agent. Our puppet catalog disables the puppet agent.

If we let cloud-init start the puppet agent then we get into all kinds of races -- cloud-init's final stage needs to force a couple of puppet runs to enable logins, but if the puppet agent has already launched puppet then those puppet runs don't complete, producing an inconsistent VM state.

To work around that, I've been masking the puppet agent so it can't be started by cloud-init. But when cloud-init finds the service masked it exists with an error code, which gets surfaced all the way up to systemd.

So... here are some options:

  • Leave things as they are, and reset-failed the cloud-init errors with puppet. They aren't likely to re-raise since all of this is happening on first boot only.
  • Rework cloud-vps to rely on the puppet agent rather than the run-puppet-agent cron (this isn't 100% bad since it would make our VMs a little bit more like normal/upstream hosts. I don't know why the WMF abhors the puppet agent.)
  • I can spend yet more time trying to mitigate the race conditions and maybe strike a perfect balance between cloud-init starting the puppet agent and then the final phase waiting on or killing any in-progress puppet runs and proceeding. This is the direction I'm inclined to go in, but I've already burned a lot of time on this (requires building new base images in some cases) and the reset-failed option is starting to look like the better part of valor.

Note that fixing puppet MIGHT NOT actually satisfy systemd; since i've never seen a clean run of all cloud-init stages I'm not 100% sure that that's the solution. Seems likely though!

  • Rework cloud-vps to rely on the puppet agent rather than the run-puppet-agent cron (this isn't 100% bad since it would make our VMs a little bit more like normal/upstream hosts. I don't know why the WMF abhors the puppet agent.)

I believe it is avoided to prevent the situation where you reboot the entire fleet and then half an hour later, the whole fleet does a puppet run together causing a potential DoS against the server. With cron, you can introduce random jitter.

@Andrew, can cloud-init finish things up with systemd reset-failed puppet-agent or something like that?

@Andrew, can cloud-init finish things up with systemd reset-failed puppet-agent or something like that?

Probably not since that would happen before it exits and it's the exit state that upsets systemd.

Either way, while it would be interesting to be able to monitor for widespread failures of cloud-init (really interesting and helpful one day), we don't right now and won't have the capability for a while. It seems fine to clean it up with a reset-failed against it. If we don't normally see success as is, it would also just kind of tell us that things are as mixed up as usual.

I installed a new Buster base image that builds with no systemd warnings. Going to let a VM sit for a while to see if the Prometheus issue appears.

I won't bother to build a new Stretch base image since new stretch VMs are pretty rare these days. Remaining things are:

  • clean up remaining cloud-init flags with cumin
  • uninstall smartmon and clean up smartmon unit failures with cumin

Mentioned in SAL (#wikimedia-cloud) [2021-07-31T00:08:51Z] <andrewbogott> "systemctl reset-failed cloud-final.service" on all VMs for T287309

Mentioned in SAL (#wikimedia-cloud) [2021-07-31T00:10:12Z] <andrewbogott> "systemctl reset-failed cloud-init.service" on all VMs for T287309

I installed a new Buster base image that builds with no systemd warnings. Going to let a VM sit for a while to see if the Prometheus issue appears.

I won't bother to build a new Stretch base image since new stretch VMs are pretty rare these days. Remaining things are:

  • clean up remaining cloud-init flags with cumin
  • uninstall smartmon and clean up smartmon unit failures with cumin <- already done with puppet

401 VMs now show 0 failed units. The rest have failures but the specific failures all over the place so I think they're valid candidates for monitoring. I don't see many prometheus issues but maybe the act of checking for it is suppressing.

@Bstorm, feel free to close this or lmk if there are other failure clusters that you'd like me to poke at.