Page MenuHomePhabricator

Restarts of ganglia-monitor are unreliable
Closed, DeclinedPublic

Description

When restarting ganglia-monitor for the expat update (along with other low level system services using expat), the restart failed on approx 5% of the cluster (full list at https://phabricator.wikimedia.org/P3144)

I sampled a few of the hosts are there are two distinct errors:

  • Some systems (e.g. ytterbium) have two gmond processes running, one with "--pid-file /var/run/gmond.pid" and one with "--pid-file /var/run/ganglia-monitor.pid"
  • Some systems (e.g. ms-be2* or db2001) have a stale pidfile which makes the restart fail with "Unknown instance:" (which surprises me since I'd expect upstart to track PIDs itself)

Event Timeline

Joe triaged this task as Medium priority.May 19 2016, 2:28 PM
Dzahn lowered the priority of this task from Medium to Low.Dec 14 2016, 7:53 PM
Dzahn subscribed.

lowering priority because meanwhile we have a goal to remove Ganglia

fgiunchedi subscribed.

Ganglia is indeed going away