Page MenuHomePhabricator

The safe service restart script doesn't detect failure when running with poolcounter.
Closed, ResolvedPublic

Description

ServiceRestarter.run returns an integer, while poolcounter.client.Client.run will only report an
error if the callback raises an exception.

This results in the fact that, when running the restart script with --max-concurrency set, the script will exit with a 0 exit code even when failing to properly restart the service.

This is particularly bad if we're running the script automatically from say systemd, as we aren't able to detect programmatically any kind of failure, and almost caused an outage when one of the lvs servers had a faulty cable and had depooled a good chunk of servers it couldn't reach.

Event Timeline

Joe triaged this task as Unbreak Now! priority.

Change 656838 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] safe-service-restart: proper error handling with poolcounter

https://gerrit.wikimedia.org/r/656838

Change 656838 merged by Giuseppe Lavagetto:
[operations/puppet@production] safe-service-restart: proper error handling with poolcounter

https://gerrit.wikimedia.org/r/656838

The script has been merged and will deploy everywhere in the next 20 minutes or so.