Page MenuHomePhabricator

Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner)
Closed, ResolvedPublic

Description

The jobrunner service is deployed via Trebuchet jobrunner/jobrunner and must be migrated to Scap.

https://wikitech.wikimedia.org/wiki/Scap3/Migration_Guide

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 354199 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap3: deploy jobrunner with scap3

https://gerrit.wikimedia.org/r/354199

Change 354199 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Scap3: deploy jobrunner with scap3

https://gerrit.wikimedia.org/r/354199

Tested on beta cluster and all seems to be working. @hashar has volunteered to babysit the initial deployment. Now just need an opsen around for puppet wrangling during the deployment to production. @akosiaris or @Joe can you help merge these changes/get jobrunner off trebuchet?

Change 357240 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap3: Deploy to groups of 5 servers

https://gerrit.wikimedia.org/r/357240

Change 357240 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Scap3: Deploy to groups of 5 servers

https://gerrit.wikimedia.org/r/357240

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:04:39Z] <akosiaris> disable puppet on all jobrunners T129148

Change 354186 merged by Alexandros Kosiaris:
[operations/puppet@production] Scap3: deploy jobrunner with scap3

https://gerrit.wikimedia.org/r/354186

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:11:51Z] <akosiaris> git pull and scap deploy --init for jobrunner T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:12:58Z] <akosiaris> running puppet on mw1161 T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:19:42Z] <akosiaris> running puppet again on tin, after moving /serv/deployment/jobrunner/jobrunner T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:25:33Z] <akosiaris> moving around jobrunner/jobrunner was probably not required T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:25:43Z] <akosiaris> running puppet on videoscalers T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:29:00Z] <akosiaris> running puppet on jobrunners T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:33:45Z] <akosiaris> restart jobchron service across jobrunners T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:35:15Z] <akosiaris> restart jobchron service across videoscalers T129148

thcipriani assigned this task to hashar.

Status?

Currently deployed via Scap3. There are some followup tasks to make deployments for jobrunner better. Closing this task to track there.

So that is not fully done. We need to restart both jobrunner and jobchron. Scap support for that has been done via T167098 / D677. We thus now need:

  • scap 3.6
  • In scap.cfg: service_name: jobrunner = reload, jobchron

Change 360856 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/services/jobrunner@master] scap: also restart jobchron

https://gerrit.wikimedia.org/r/360856

Just now I tried to deploy an update to mediawiki/services/jobrunner. Given the Trebuchet entry point still exists on tin:/srv/deployment/jobrunner/jobrunner, and the documentation still mentions Trebuchet on https://wikitech.wikimedia.org/wiki/Jobrunner, and this task wasn't closed, I assumed it was still on Trebuchet. However, I see now that it was actually already converted to Scap3.

The deployment failed at the git deploy sync step ("0/44 minions completed fetch"). So probably nothing happened.

Just now I tried to deploy an update to mediawiki/services/jobrunner. Given the Trebuchet entry point still exists on tin:/srv/deployment/jobrunner/jobrunner, and the documentation still mentions Trebuchet on https://wikitech.wikimedia.org/wiki/Jobrunner, and this task wasn't closed, I assumed it was still on Trebuchet. However, I see now that it was actually already converted to Scap3.

The deployment failed at the git deploy sync step ("0/44 minions completed fetch"). So probably nothing happened.

After this, I used Scap3 instead. Unfortunately, it seems Jobrunner was in a state ready to explode for the next deployer.

Incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20170718-JobQueue.

Change 367743 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/services/jobrunner@master] scap: Remove restart configuration

https://gerrit.wikimedia.org/r/367743

I guess in deploying with trebuchet any service restarts were handled with salt. This explains why we never ran into these issues when using trebuchet. We are currently in the process of removing salt, and its replacement tool cumin is not open to non-roots (is my understanding).

There are two separate issues stopping the deployment of Jobrunner.

No restart for non-active DC

I think there is a workaround for T167104: Figure out how to disable starting of jobrunner/jobchron in the non-active DC. I would like to see this solved somewhere in puppet/systemd. Over the short term separate scap environments (https://doc.wikimedia.org/mw-tools-scap/scap3/quickstart/setup.html#environments) could work.

The first environment, the default, would not define a service_name. The second environment would. Restarting the services would be a separate command (like it was, evidently, with salt previously)

The directory layout would look like:

scap                                      
├── environments                          
│   └── active                         
│       └── scap.cfg                      
└── scap.cfg
scap.cfg
[global]                                                                                                                                                                 
git_repo: jobrunner/jobrunner                                                                                                                                            
git_repo_user: mwdeploy                                                                                                                                                  
ssh_user: mwdeploy                                                                                                                                                       
server_groups: default                                                                                                                                           
dsh_targets: jobrunner                                                                                                                                                   
git_submodules: False                                                                                                                                                     
                                                                                                                                                                         
# Divide servers into groups of 5                                                                                                                                        
group_size: 5
                                                                                                                                                                         
[deployment-prep.eqiad.wmflabs]                                                                                                                                          
server_groups: default                                                                                                                                                   
dsh_targets: betacluster
environments/active/scap.cfg
[global]  
server_groups: default                                                                                       
dsh_targets: jobrunner-active
service_name: jobrunner
service_port: 9005

The dsh_targets files are the key here. The dsh file jobrunner would have all jobrunners: all jobrunners will receive new code as part of deployment. The dsh file jobrunner-active would have all jobrunners listed in the active datacenter AND because it defines a service_name and service_port it will restart those jobrunners.

A deployment would be two steps, a deploy, and a restart of the active environment.

What a deploy would look like

cd /srv/deployment/jobrunner/jobrunner
scap deploy "deploy new jobrunner code in the default environment"
scap deploy --environment active --service-restart "restart jobrunner service for active environment"

Restarting multiple services

There are a few problems here. First as pointed out in IRC by @Krinkle, the mwdeploy user only has access to restart jobrunner service and jobchron.

(root) NOPASSWD: /usr/sbin/service jobrunner *

This needs a puppet patch.

On the deployment tooling side, while this can be worked around using some weird environment setup, scap 3.6.0-1 (which I recently tagged https://phabricator.wikimedia.org/T127762#3472190) should be out soon. This should enable a small tweak to the scap.cfg so that the line service_name: jobrunner, jobchron will restart both.

Change 367815 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Allow mwdeploy user to restart jobchron

https://gerrit.wikimedia.org/r/367815

Mentioned in SAL (#wikimedia-operations) [2017-07-27T20:24:35Z] <Krinkle> Un-dirtying state of /srv/deployment/jobrunner/jobrunner on tin (from T129148). Checking-out https://gerrit.wikimedia.org/r/367743 instead.

Change 367743 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] scap: Remove restart configuration

https://gerrit.wikimedia.org/r/367743

Change 368476 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Jobrunner: create dsh groups per datacenter

https://gerrit.wikimedia.org/r/368476

Change 368476 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Jobrunner: create dsh groups per datacenter

https://gerrit.wikimedia.org/r/368476

This patch should be sufficient to ensure that we can restart in the active data center only. The setup for this requires only a slight modification to how I've described environments/active/scap.cfg above:

environments/active/scap.cfg
[global]  
server_groups: default                                                                                       
service_name: jobrunner,jobchron
service_port: 9005

[eqiad.wmnet]
dsh_targets: jobrunner-eqiad

[codfw.wmnet]
dsh_targets: jobrunner-codfw

The sections of the config file are based on the domain of the deployment host. Using the above config file and issuing the command:

scap deploy --environment active --service-restart "Restart jobrunner for reasons™"

from codfw it will restart targets in the codfw datacenter, likewise issuing the same command from the tin.eqiad.wmnet would only restart jobrunner in the eqiad datacenter.

Change 368808 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap: Create "active" environment for restart

https://gerrit.wikimedia.org/r/368808

Do we have a mechanism to automatically select the right environment? Having to remember to use --environment active seems less than ideal.

Change 368476 abandoned by Thcipriani:
Jobrunner: create dsh groups per datacenter

https://gerrit.wikimedia.org/r/368476

Do we have a mechanism to automatically select the right environment? Having to remember to use --environment active seems less than ideal.

Indeed. @fgiunchedi helped me come up with a new plan. That should allow us to stick to the simple scap deploy -v route.

D743, after it lands, will create a configuration variable require_valid_service which will check the masked state of a service before attempting to restart the service. We'll need to ensure that in the non-active datacenter both jobrunner and jobchron are masked by systemd, and if so, scap won't attempt to restart them.

Change 367815 merged by Filippo Giunchedi:
[operations/puppet@production] Allow mwdeploy user to restart jobchron

https://gerrit.wikimedia.org/r/367815

Krinkle updated the task description. (Show Details)

From yesterday Release-Engineering-Team meeting, Tyler is the one working on this since the task involves a bunch of tweaks to be made in Scap

Change 368808 abandoned by Thcipriani:
Scap: Create "active" environment for restart

https://gerrit.wikimedia.org/r/368808

Change 376159 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap: restart services only in active dc

https://gerrit.wikimedia.org/r/376159

Now that scap 3.7.0-1 is live, the configuration variable require_valid_service should ensure that scap does not attempt to restart a service when systemctl show --property LoadState [service] comes back either not-found or masked.

After https://gerrit.wikimedia.org/r/376159 is merged, to run a deploy:

cd /srv/deployment/jobrunner/jobrunner
scap deploy -vf "some message for the sal"

This will pull the code from tin:/srv/deployment/jobrunner/jobrunner to /srv/deployment/jobrunner/jobrunner-cache/cache on the targets, checkout the code to /srv/deployment/jobrunner/jobrunner, restart jobrunner and jobchron and check for tcp connections on port 9005 (in the active datacenter only), and then run a cleanup. It will run through all those steps on the canaries (mw1299.eqiad.wmnet and mw2247.codfw.wmnet), ask for confirmation, and then go through those same steps for all the jobrunners in groups of 5 prompting to continue between groups.

@Krinkle is there jobrunner code to be deployed that you can confirm? I'd like ensure services are handled correctly/to be able to troubleshoot any problems for the first deployment.

Restore ability for humans to restart jobchron service (lost when limited Salt-ability was removed from Tin)

To restart jobrunner/jobchron from tin: scap deploy -v --service-restart "restarting jobrunners in active dc"

This can be limited to specific hosts via: scap deploy -v --service-restart --limit-hosts [hostname or range] "message for sal"

@Krinkle is there jobrunner code to be deployed that you can confirm? I'd like ensure services are handled correctly/to be able to troubleshoot any problems for the first deployment.

This is the last trebuchet thing! Let's verify this works and close this out and kill salt :)

Change 360856 abandoned by Krinkle:
scap: also restart jobchron

Reason:
Superseded by Iecdb5f2726a02a9f0

https://gerrit.wikimedia.org/r/360856

Change 376159 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Scap: restart services only in active dc

https://gerrit.wikimedia.org/r/376159

Mentioned in SAL (#wikimedia-operations) [2017-09-18T20:38:43Z] <krinkle@tin> Started deploy [jobrunner/jobrunner@57f5f47]: No-op sync - first time scap3 - T129148

Mentioned in SAL (#wikimedia-operations) [2017-09-18T20:42:06Z] <krinkle@tin> Finished deploy [jobrunner/jobrunner@57f5f47]: No-op sync - first time scap3 - T129148 (duration: 03m 23s)