Page MenuHomePhabricator

puppet-run is confused by stale lock files
Closed, ResolvedPublic

Description

Puppet runs on tools-webgrid-06 took longer than usual, probably due to some issue with virt1000; this caused puppet-run's timeout to kick in and kill the Puppet run:

[…]
Notice: Caught TERM; calling stop

Subsequent runs of puppet-run found the stale lock file and stopped short of starting Puppet:

Skipping this run, puppet agent already running at pid 4685

However, there was no process running as pid 4685, and manual puppet agent -tv invocations started happily (with their own locking working properly, i. e. "Notice: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists)").

The locking check was added in https://gerrit.wikimedia.org/r/#/c/196162/; I don't understand what it is meant to protect against or how it would do that, but it's not working at the moment.

Event Timeline

scfc raised the priority of this task from to Needs Triage.
scfc updated the task description. (Show Details)
scfc added projects: Puppet, Cloud-Services.
scfc added subscribers: scfc, BBlack.

It's to protect against a scenario I've run into repeatedly on fresh machine installs: while the admin is executing the initial puppet run to configure the host (which can take several minutes), puppet deploys the cron that periodically executes puppet-run, and then by timing coincidence (for a 5 minute install, odds are ~25% due to 20-minute cron spacing) puppet-run kicks off while the initial puppet is still running. Puppet itself within puppet-run would fail the lock on its own, but the puppet-run scripts does an apt-get update before that, which clashes with the initial puppet run's attempt to install packages, causing failure of the initial puppet run. The same scenario can (and sometimes, does) also unfold any time an admin is manually running puppet on a host, although with lower probably due to shorter puppet runtimes.

Should probably just upgrade the lockfile check in puppet-run to actually check that (a) the given PID is alive and (b) the process name matches /puppet/ and (c) the mtime of the file is within, say, the past half-hour. That will at least eliminate common failures from stale lockfiles.

Even better would be to find a way to due true mutual locking between all possible invocations of puppet and puppet-run, but I haven't seen an elegant way to do that yet, given that puppet's lockfile code is not very robust (IMHO, it should be using fcntl() locks that endure for the duration of the agent run and expire automatically at process exit...). Either that or find a way to fix apt and/or puppet's "package" type such that concurrent installs and database updates wait on each other properly instead of failing.

Change 196963 had a related patch set uploaded (by BBlack):
puppet-run lock: check running pid cmdline match T92766

https://gerrit.wikimedia.org/r/196963

Change 196963 merged by BBlack:
puppet-run lock: check running pid cmdline match T92766

https://gerrit.wikimedia.org/r/196963

Note the above only implements (a) and (b); it lacks the mtime check, but that may not prove necessary anyways.

BBlack claimed this task.

I'm assuming the fixup from a week ago worked for labs as well, closing. Re-open if not! :)