Page MenuHomePhabricator

Figure out how to disable starting of jobrunner/jobchron in the non-active DC
Closed, ResolvedPublic

Description

scap deploys the jobrunner service to all hosts and restart the jobrunner service on all of them. However, it should not be started on the non-active datacenter. We still want to deploy code on all datacenters though.

@akosiaris suggests to handle that in Puppet so that the service is marked disabled. That would prevent restarts.

Event Timeline

I 've had a quick look into the mask feature of systemd. That should allow us to mark a service as masked and not allowed it to be started manually or automatically. However puppet does not support it well before 4.2.0 (https://github.com/puppetlabs/puppet/commit/1e2a71604e184477f94d516d86366adf1fef2452). Also relying on it looks wrong for multiple reasons

  • When we actually upgrade to puppet 4.2.0 where mask is supported, manually masked services might very well transition to a different state (the one configured by puppet)
  • We have still trusty hosts which don't have systemd and mask is systemd specific
  • Even on jessie hosts we have 3.8 puppet and upgrading to 4.x is not gonna happen really soon.
Krinkle renamed this task from figure out how to not restart jobrunner/jobchron in the non-active DC to Figure out how to disable starting of jobrunner/jobchron in the non-active DC.Jul 25 2017, 10:04 PM

I made a couple of patches that attempt to address this problem that I've attached to T129148: Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner):

Full description of how this might work is at T129148#3482379

Although, doing this in a way that doesn't rely on scap logic is still my ideal, so feedback welcome :)

FWIW, masking the service caused the restart command to exit non-zero which caused scap to fail.

@fgiunchedi gave me feedback on my previous plan that led to the creation of D743.

Now that D743 has merged, a mask (or otherwise a removal) of jobrunner/jobchron services in the non-active datacenter and adding require_valid_service: True to the scap config should prevent scap from attempting a restart in the non-active DC.

Change 374438 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Mask jobchron and jobrunner in non-active DC

https://gerrit.wikimedia.org/r/374438

Change 374438 merged by Alexandros Kosiaris:
[operations/puppet@production] Mask jobchron and jobrunner in non-active DC

https://gerrit.wikimedia.org/r/374438

Change 374822 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mask file should be in /etc directory

https://gerrit.wikimedia.org/r/374822

Change 374822 merged by Alexandros Kosiaris:
[operations/puppet@production] mask file should be in /etc directory

https://gerrit.wikimedia.org/r/374822

Change 374822 merged by Alexandros Kosiaris:

\o/ thanks for the review and follow-up @akosiaris

Just checked double-checked some of the jobrunners this morning:

mwdeploy@mw2153:~$ /bin/systemctl show --property LoadState jobchron                                                                                                     
LoadState=masked                                                                                                                                                         
mwdeploy@mw2153:~$ /bin/systemctl show --property LoadState jobrunner                                                                                                    
LoadState=masked

which means that they can't be manually started, and the LoadState is exactly what scap will be looking for: https://github.com/wikimedia/scap/blob/master/scap/utils.py#L693-L715

\o/. Isn't there anything else left to do or can we declare victory on this one ?

Krinkle claimed this task.
Krinkle reassigned this task from Krinkle to thcipriani.
Krinkle subscribed.