Page MenuHomePhabricator

jobrunner / jobchron systemd services are in error state after a stop
Closed, ResolvedPublic

Description

On beta cluster deployment-jobrunner02.deployment-prep.eqiad.wmflabs, whenever one stops jobrunner or jobchron the service is reported as failed in systemd:

# systemctl stop jobchron
# systemctl status jobchron
● jobchron.service - "Mediawiki job queue chron loop"
   Loaded: loaded (/lib/systemd/system/jobchron.service; enabled)
           vvvvvv
   Active: failed (Result: exit-code) since Fri 2017-06-16 09:27:07 UTC; 1s ago
           ^^^^^^
  Process: 29240 ExecStart=/usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobChronService --config-file=${JOBRUNNER_CONFIG} ${DAEMON_OPTS} (code=exited, status=143)
                                      vvv
 Main PID: 29240 (code=exited, status=143)
                                      ^^^

The reason is that both redisJobChronService and redisJobRunnerService catch signals HUP, INT, TERM and exit() with 128 + <signal number>.

We can make systemd to recognizes a non zero exit is valid by using SuccessExitStatus.

Event Timeline

Change 357362 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] jobrunner: add exit codes to services units

https://gerrit.wikimedia.org/r/357362

The alternative is to have redisJobChronService and redisJobRunnerService to exit( 0 );.

Marostegui added a subscriber: Marostegui.

I would rather go for for your last solution (T168044#3354093) than faking a bit the exit codes on systemd.
Not really strong opinion for any of both approaches, as long as it still gives a fail (exit 1) if it is killed or it dies by itself

Change 357362 abandoned by Hashar:
jobrunner: add exit codes to services units

Reason:
Per Marostegui suggestion on T168044, lets just have the daemon exit(0) instead.

https://gerrit.wikimedia.org/r/357362

Change 359923 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/services/jobrunner@master] Services now exit(0) when catching a signal

https://gerrit.wikimedia.org/r/359923

hashar lowered the priority of this task from Medium to Low.Jun 19 2017, 12:45 PM
hashar changed the task status from Open to Stalled.Aug 30 2017, 8:17 PM

That one depends on T129148 completion. Currently pending for a new version of scap to be deployed (for non active dc support).

Change 359923 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Services now exit(0) when catching a signal

https://gerrit.wikimedia.org/r/359923

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:04:25Z] <hashar@tin> Started restart [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:07:57Z] <hashar@tin> Started restart [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:11:43Z] <hashar@tin> Started restart [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:13:28Z] <hashar@tin> Started deploy [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:13:35Z] <hashar@tin> Finished deploy [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044 (duration: 00m 02s)

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:14:09Z] <hashar@tin> Started deploy [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044

Mentioned in SAL (#wikimedia-operations) [2017-10-11T20:17:03Z] <hashar@tin> Finished deploy [jobrunner/jobrunner@a20d043]: Services now exit(0) when catching a signal - T168044 (duration: 02m 54s)

hashar added a subscriber: thcipriani.

Solved. All kudos/credits go to @thcipriani and the Scap developers !