Page MenuHomePhabricator

puppet stopped mysqld using orphan pid file from puppet agent
Closed, ResolvedPublic

Description

A puppet run on db1050 stopped mysqld using an orphan pid file for it own agent, /var/run/puppet/agent.pid, which happened to match the mysqld pid after that service restarted.

mysqld db1050.err:

150111 15:24:04 [Note] /opt/wmf-mariadb10/bin/mysqld: Normal shutdown

puppet.log:

Jan 11 15:24:04 db1050 puppet-agent[1849]: (/Stage[main]/Base::Puppet/Service[puppet]/ensure) ensure changed 'running' to 'stopped'

Unsure why this didn't trigger sooner since the orphan pid file timestamp was months old and mysqld had been running for three days.

How to prevent this in the future... sweep check for orphan pid files, or make puppet smarter about blindly stopping services, or treat as a fluke?

Event Timeline

Springle claimed this task.
Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle added a project: acl*sre-team.
Springle added a subscriber: Springle.

Increasing the sysctl "kernel.pid_max" to 4M would probably work as a temporary hack, and would help to mitigate similar errors in other programs. Ideally puppet would check /proc/$pid/cmdline or something to make sure it is killing the right thing.

Unsure why this didn't trigger sooner since the orphan pid file timestamp was months old and mysqld had been running for three days.

The number in the PID file did not correspond to mysqld's PID, or else Puppet would have killed it sooner. It must have corresponded to the TID of one of mysqld's threads at the time Puppet ran.

Puppet checked the service status by running /etc/init.d/puppet status, which called /lib/lsb/init-function's status_of_proc, which confirmed that the service was running by ensuring that kill -0 succeeded when invoked with the number from the PID file. It then killed the process using start-stop-daemon, which called kill(2) with the same number as its argument.

[[ http://manpages.courier-mta.org/htmlman1/kill.1.html#kill-1_sect3 | Both kill(2) and kill(1) can be called with a TID instead of a PID ]]. When start-stop-daemon attempted to send a TERM signal to the TID in Puppet's pidfile, the signal was sent to the entire thread group, causing mysqld to shut down.

I think this can be prevented now by changing /etc/init.d/puppet to use --name and possibly --user in addition to --pidfile with start-stop-daemon.

In a future where we switch to Debian and thus systemd AFAIK this would not happen with a native service (it uses a cgroup per service for tracking and killing it) instead of a shell init script.

But to completely prevent anything like it we would need to change this for all services running on one system.

fgiunchedi triaged this task as Medium priority.Apr 1 2015, 11:24 AM
fgiunchedi added a subscriber: fgiunchedi.

agreed, in a systemd world services are shielded from each other, I've checked across the fleet but found no spurious puppet pid files. perhaps an artifact of upgrades? iirc db machines were upgraded in place to trusty

Marostegui added a subscriber: Marostegui.

After having a chat with Filippo we believe this hasn't happened again so we are closing it for now. Feel free to reopen if this is seen again.