When restarting ganglia-monitor for the expat update (along with other low level system services using expat), the restart failed on approx 5% of the cluster (full list at https://phabricator.wikimedia.org/P3144)
I sampled a few of the hosts are there are two distinct errors:
- Some systems (e.g. ytterbium) have two gmond processes running, one with "--pid-file /var/run/gmond.pid" and one with "--pid-file /var/run/ganglia-monitor.pid"
- Some systems (e.g. ms-be2* or db2001) have a stale pidfile which makes the restart fail with "Unknown instance:" (which surprises me since I'd expect upstart to track PIDs itself)