Page MenuHomePhabricator

Unexpected auditd service restart failure
Closed, ResolvedPublic

Description

When running puppet agent for the first time on the doh* hosts after https://gerrit.wikimedia.org/r/q/Id438985ffe720dc630f0e43eed8bda4a47c9196c, the auditd service failed to start with the following message:

Created symlink /etc/systemd/system/multi-user.target.wants/auditd.service -> /lib/systemd/system/auditd.service.
Job for auditd.service failed because a timeout was exceeded. See "systemctl status auditd.service" and "journalctl -xe" for details. invoke-rc.d: initscript auditd, action "start" failed. * auditd.service
- Security Auditing Service Loaded: loaded (/lib/systemd/system/auditd.service; enabled; vendor preset: enabled) Active: failed

The failure happened on some hosts but not all of them, within the same change. It seems like this is a bug in auditd: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=962451 where it timeouts when started after an install and then doing it again fixes it; this matches the behaviour we observed. A possible fix is available at https://github.com/linux-audit/audit-userspace/commit/ee6608eca034494fc2597b2990852adec236e486.

We should observe if this happens again after subsequent restarts and then consider backporting the patch to our auditd build.

Event Timeline

AFAICT we aren't packaging auditd ourselves. It might be easiest to just notify a trigger to re-start the stupid service after install since it looks like Debian isn't going to fix it.

Per the bug that should be fixed in the auditd package in Bullseye, we'll be able to confirm when we reimage the doh* servers to Bullseye.

Ah, my bad, I thought this *was* affecting bullseye. Oops. Sounds good then.

ssingh claimed this task.

We reimaged two hosts to bullseye and didn't notice any auditd failure, so confirming what @MoritzMuehlenhoff said above and marking this as resolved.