⚓ T129148 Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner)

Subject	Repo	Branch	Lines +/-
Scap: restart services only in active dc	mediawiki/services/jobrunner	master	+6 -0
scap: also restart jobchron	mediawiki/services/jobrunner	master	+1 -1
Scap: Create "active" environment for restart	mediawiki/services/jobrunner	master	+12 -2
Allow mwdeploy user to restart jobchron	operations/puppet	production	+8 -0
Jobrunner: create dsh groups per datacenter	operations/puppet	production	+8 -4
scap: Remove restart configuration	mediawiki/services/jobrunner	master	+0 -2
Scap3: deploy jobrunner with scap3	operations/puppet	production	+17 -7
Scap3: Deploy to groups of 5 servers	mediawiki/services/jobrunner	master	+3 -0
Scap3: deploy jobrunner with scap3	mediawiki/services/jobrunner	master	+16 -0

Status	Assigned	Task
Resolved	MoritzMuehlenhoff	T164780 Sunset our use of Salt
Resolved	thcipriani	T129290 [keyresult] Migrate remaining trebuchet deployed services
Resolved	hashar	T168044 jobrunner / jobchron systemd services are in error state after a stop
Resolved	thcipriani	T129148 Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner)
Resolved	thcipriani	T167098 scap should allow restarting multiple services
Resolved	thcipriani	T167104 Figure out how to disable starting of jobrunner/jobchron in the non-active DC

gerritbot added a project: Patch-For-Review.May 18 2017, 8:36 AM

Change 354199 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap3: deploy jobrunner with scap3

https://gerrit.wikimedia.org/r/354199

thcipriani triaged this task as Medium priority.May 19 2017, 12:17 PM

thcipriani added a project: Release-Engineering-Team (Kanban).

thcipriani moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

Change 354199 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Scap3: deploy jobrunner with scap3

https://gerrit.wikimedia.org/r/354199

Tested on beta cluster and all seems to be working. @hashar has volunteered to babysit the initial deployment. Now just need an opsen around for puppet wrangling during the deployment to production. @akosiaris or @Joe can you help merge these changes/get jobrunner off trebuchet?

Change 357240 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap3: Deploy to groups of 5 servers

https://gerrit.wikimedia.org/r/357240

Change 357240 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Scap3: Deploy to groups of 5 servers

https://gerrit.wikimedia.org/r/357240

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:04:39Z] <akosiaris> disable puppet on all jobrunners T129148

Change 354186 merged by Alexandros Kosiaris:
[operations/puppet@production] Scap3: deploy jobrunner with scap3

https://gerrit.wikimedia.org/r/354186

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:08:24Z] <akosiaris> running puppet on tin T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:11:51Z] <akosiaris> git pull and scap deploy --init for jobrunner T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:12:58Z] <akosiaris> running puppet on mw1161 T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:19:42Z] <akosiaris> running puppet again on tin, after moving /serv/deployment/jobrunner/jobrunner T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:25:33Z] <akosiaris> moving around jobrunner/jobrunner was probably not required T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:25:43Z] <akosiaris> running puppet on videoscalers T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:29:00Z] <akosiaris> running puppet on jobrunners T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:33:45Z] <akosiaris> restart jobchron service across jobrunners T129148

Mentioned in SAL (#wikimedia-operations) [2017-06-06T09:35:15Z] <akosiaris> restart jobchron service across videoscalers T129148

hashar created subtask T167098: scap should allow restarting multiple services.Jun 6 2017, 10:02 AM

hashar created subtask T167104: Figure out how to disable starting of jobrunner/jobchron in the non-active DC.Jun 6 2017, 10:11 AM

Status?

In T129148#3336047, @greg wrote:

Status?

Currently deployed via Scap3. There are some followup tasks to make deployments for jobrunner better. Closing this task to track there.

\o/

thcipriani closed subtask T167098: scap should allow restarting multiple services as Resolved.Jun 21 2017, 5:24 PM

So that is not fully done. We need to restart both jobrunner and jobchron. Scap support for that has been done via T167098 / D677. We thus now need:

scap 3.6
In scap.cfg: service_name: jobrunner = reload, jobchron

Change 360856 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/services/jobrunner@master] scap: also restart jobchron

https://gerrit.wikimedia.org/r/360856

hashar moved this task from In-progress to Blocked (externally) on the Release-Engineering-Team (Kanban) board.Jul 7 2017, 3:05 PM

Krinkle edited projects, added Performance-Team; removed Patch-For-Review.Jul 18 2017, 9:47 PM

Krinkle moved this task from Inbox, needs triage to Radar on the Performance-Team board.

We also have to update the deployment section on https://wikitech.wikimedia.org/wiki/Jobrunner

Just now I tried to deploy an update to mediawiki/services/jobrunner. Given the Trebuchet entry point still exists on tin:/srv/deployment/jobrunner/jobrunner, and the documentation still mentions Trebuchet on https://wikitech.wikimedia.org/wiki/Jobrunner, and this task wasn't closed, I assumed it was still on Trebuchet. However, I see now that it was actually already converted to Scap3.

The deployment failed at the git deploy sync step ("0/44 minions completed fetch"). So probably nothing happened.

In T129148#3450253, @Krinkle wrote:

Just now I tried to deploy an update to mediawiki/services/jobrunner. Given the Trebuchet entry point still exists on tin:/srv/deployment/jobrunner/jobrunner, and the documentation still mentions Trebuchet on https://wikitech.wikimedia.org/wiki/Jobrunner, and this task wasn't closed, I assumed it was still on Trebuchet. However, I see now that it was actually already converted to Scap3.

The deployment failed at the git deploy sync step ("0/44 minions completed fetch"). So probably nothing happened.

After this, I used Scap3 instead. Unfortunately, it seems Jobrunner was in a state ready to explode for the next deployer.

Incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20170718-JobQueue.

Change 367743 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/services/jobrunner@master] scap: Remove restart configuration

https://gerrit.wikimedia.org/r/367743

gerritbot added a project: Patch-For-Review.Jul 25 2017, 8:08 PM

I guess in deploying with trebuchet any service restarts were handled with salt. This explains why we never ran into these issues when using trebuchet. We are currently in the process of removing salt, and its replacement tool cumin is not open to non-roots (is my understanding).

There are two separate issues stopping the deployment of Jobrunner.

No restart for non-active DC

I think there is a workaround for T167104: Figure out how to disable starting of jobrunner/jobchron in the non-active DC. I would like to see this solved somewhere in puppet/systemd. Over the short term separate scap environments (https://doc.wikimedia.org/mw-tools-scap/scap3/quickstart/setup.html#environments) could work.

The first environment, the default, would not define a service_name. The second environment would. Restarting the services would be a separate command (like it was, evidently, with salt previously)

The directory layout would look like:

scap                                      
├── environments                          
│   └── active                         
│       └── scap.cfg                      
└── scap.cfg

scap.cfg

[global]                                                                                                                                                                 
git_repo: jobrunner/jobrunner                                                                                                                                            
git_repo_user: mwdeploy                                                                                                                                                  
ssh_user: mwdeploy                                                                                                                                                       
server_groups: default                                                                                                                                           
dsh_targets: jobrunner                                                                                                                                                   
git_submodules: False                                                                                                                                                     
                                                                                                                                                                         
# Divide servers into groups of 5                                                                                                                                        
group_size: 5
                                                                                                                                                                         
[deployment-prep.eqiad.wmflabs]                                                                                                                                          
server_groups: default                                                                                                                                                   
dsh_targets: betacluster

environments/active/scap.cfg

[global]  
server_groups: default                                                                                       
dsh_targets: jobrunner-active
service_name: jobrunner
service_port: 9005

The dsh_targets files are the key here. The dsh file jobrunner would have all jobrunners: all jobrunners will receive new code as part of deployment. The dsh file jobrunner-active would have all jobrunners listed in the active datacenter AND because it defines a service_name and service_port it will restart those jobrunners.

A deployment would be two steps, a deploy, and a restart of the active environment.

What a deploy would look like

cd /srv/deployment/jobrunner/jobrunner
scap deploy "deploy new jobrunner code in the default environment"
scap deploy --environment active --service-restart "restart jobrunner service for active environment"

Restarting multiple services

There are a few problems here. First as pointed out in IRC by @Krinkle, the mwdeploy user only has access to restart jobrunner service and jobchron.

(root) NOPASSWD: /usr/sbin/service jobrunner *

This needs a puppet patch.

On the deployment tooling side, while this can be worked around using some weird environment setup, scap 3.6.0-1 (which I recently tagged https://phabricator.wikimedia.org/T127762#3472190) should be out soon. This should enable a small tweak to the scap.cfg so that the line service_name: jobrunner, jobchron will restart both.

Change 367815 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Allow mwdeploy user to restart jobchron

https://gerrit.wikimedia.org/r/367815

Krinkle updated the task description. (Show Details)Jul 25 2017, 10:06 PM

Krinkle updated the task description. (Show Details)Jul 25 2017, 10:10 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-27T20:24:35Z] <Krinkle> Un-dirtying state of /srv/deployment/jobrunner/jobrunner on tin (from T129148). Checking-out https://gerrit.wikimedia.org/r/367743 instead.

Change 367743 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] scap: Remove restart configuration

https://gerrit.wikimedia.org/r/367743

Change 368476 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Jobrunner: create dsh groups per datacenter

https://gerrit.wikimedia.org/r/368476

In T129148#3482310, @gerritbot wrote:

Change 368476 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Jobrunner: create dsh groups per datacenter

https://gerrit.wikimedia.org/r/368476

This patch should be sufficient to ensure that we can restart in the active data center only. The setup for this requires only a slight modification to how I've described environments/active/scap.cfg above:

environments/active/scap.cfg

[global]  
server_groups: default                                                                                       
service_name: jobrunner,jobchron
service_port: 9005

[eqiad.wmnet]
dsh_targets: jobrunner-eqiad

[codfw.wmnet]
dsh_targets: jobrunner-codfw

The sections of the config file are based on the domain of the deployment host. Using the above config file and issuing the command:

scap deploy --environment active --service-restart "Restart jobrunner for reasons™"

from codfw it will restart targets in the codfw datacenter, likewise issuing the same command from the tin.eqiad.wmnet would only restart jobrunner in the eqiad datacenter.

Change 368808 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap: Create "active" environment for restart

https://gerrit.wikimedia.org/r/368808

thcipriani mentioned this in T167104: Figure out how to disable starting of jobrunner/jobchron in the non-active DC.Jul 31 2017, 4:03 PM

Do we have a mechanism to automatically select the right environment? Having to remember to use --environment active seems less than ideal.

Krinkle mentioned this in T172479: Collect error logs from jobchron/jobrunner services in Logstash.Aug 4 2017, 3:28 AM

Krinkle mentioned this in T172480: Add a jobrunner server to the Scap canary pool.Aug 4 2017, 3:40 AM

Krinkle mentioned this in T172447: Investigate 2017-08-02 Save Timing regression (+40-60%).Aug 4 2017, 10:49 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:15 AM

Change 368476 abandoned by Thcipriani:
Jobrunner: create dsh groups per datacenter

https://gerrit.wikimedia.org/r/368476

In T129148#3487115, @mmodell wrote:

Do we have a mechanism to automatically select the right environment? Having to remember to use --environment active seems less than ideal.

Indeed. @fgiunchedi helped me come up with a new plan. That should allow us to stick to the simple scap deploy -v route.

D743, after it lands, will create a configuration variable require_valid_service which will check the masked state of a service before attempting to restart the service. We'll need to ensure that in the non-active datacenter both jobrunner and jobchron are masked by systemd, and if so, scap won't attempt to restart them.

thcipriani updated the task description. (Show Details)Aug 9 2017, 5:19 PM

thcipriani mentioned this in rMSCA97a9610390a0: `require_valid_service` to check service mask.Aug 11 2017, 4:30 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Aug 16 2017, 6:55 PM

Change 367815 merged by Filippo Giunchedi:
[operations/puppet@production] Allow mwdeploy user to restart jobchron

https://gerrit.wikimedia.org/r/367815

Krinkle updated the task description. (Show Details)Aug 28 2017, 3:19 PM

Krinkle updated the task description. (Show Details)

From yesterday Release-Engineering-Team meeting, Tyler is the one working on this since the task involves a bunch of tweaks to be made in Scap

thcipriani updated the task description. (Show Details)Aug 30 2017, 3:58 PM

thcipriani updated the task description. (Show Details)

Krinkle closed subtask T167104: Figure out how to disable starting of jobrunner/jobchron in the non-active DC as Resolved.Aug 30 2017, 5:02 PM

hashar added a parent task: T168044: jobrunner / jobchron systemd services are in error state after a stop.Aug 30 2017, 8:17 PM

hashar mentioned this in T168044: jobrunner / jobchron systemd services are in error state after a stop.

Change 368808 abandoned by Thcipriani:
Scap: Create "active" environment for restart

https://gerrit.wikimedia.org/r/368808

Change 376159 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/services/jobrunner@master] Scap: restart services only in active dc

https://gerrit.wikimedia.org/r/376159

thcipriani updated the task description. (Show Details)Sep 6 2017, 12:17 AM

Now that scap 3.7.0-1 is live, the configuration variable require_valid_service should ensure that scap does not attempt to restart a service when systemctl show --property LoadState [service] comes back either not-found or masked.

After https://gerrit.wikimedia.org/r/376159 is merged, to run a deploy:

cd /srv/deployment/jobrunner/jobrunner
scap deploy -vf "some message for the sal"

This will pull the code from tin:/srv/deployment/jobrunner/jobrunner to /srv/deployment/jobrunner/jobrunner-cache/cache on the targets, checkout the code to /srv/deployment/jobrunner/jobrunner, restart jobrunner and jobchron and check for tcp connections on port 9005 (in the active datacenter only), and then run a cleanup. It will run through all those steps on the canaries (mw1299.eqiad.wmnet and mw2247.codfw.wmnet), ask for confirmation, and then go through those same steps for all the jobrunners in groups of 5 prompting to continue between groups.

@Krinkle is there jobrunner code to be deployed that you can confirm? I'd like ensure services are handled correctly/to be able to troubleshoot any problems for the first deployment.

Restore ability for humans to restart jobchron service (lost when limited Salt-ability was removed from Tin)

To restart jobrunner/jobchron from tin: scap deploy -v --service-restart "restarting jobrunners in active dc"

This can be limited to specific hosts via: scap deploy -v --service-restart --limit-hosts [hostname or range] "message for sal"

In T129148#3582888, @thcipriani wrote:

@Krinkle is there jobrunner code to be deployed that you can confirm? I'd like ensure services are handled correctly/to be able to troubleshoot any problems for the first deployment.

This is the last trebuchet thing! Let's verify this works and close this out and kill salt :)

itshappening

Change 360856 abandoned by Krinkle:
scap: also restart jobchron

Reason:
Superseded by Iecdb5f2726a02a9f0

https://gerrit.wikimedia.org/r/360856

Change 376159 merged by jenkins-bot:
[mediawiki/services/jobrunner@master] Scap: restart services only in active dc

https://gerrit.wikimedia.org/r/376159

Mentioned in SAL (#wikimedia-operations) [2017-09-18T20:38:43Z] <krinkle@tin> Started deploy [jobrunner/jobrunner@57f5f47]: No-op sync - first time scap3 - T129148

Mentioned in SAL (#wikimedia-operations) [2017-09-18T20:42:06Z] <krinkle@tin> Finished deploy [jobrunner/jobrunner@57f5f47]: No-op sync - first time scap3 - T129148 (duration: 03m 23s)

I've updated https://wikitech.wikimedia.org/wiki/Jobrunner#Deployment. Please review https://wikitech.wikimedia.org/w/index.php?title=Jobrunner&type=revision&diff=1770333&oldid=1766155.

Krinkle moved this task from Radar to Doing (old) on the Performance-Team board.Sep 18 2017, 8:51 PM

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

MoritzMuehlenhoff subscribed.Sep 18 2017, 8:57 PM

Docs lgtm

In T129148#3616198, @Krinkle wrote:

I've updated https://wikitech.wikimedia.org/wiki/Jobrunner#Deployment. Please review https://wikitech.wikimedia.org/w/index.php?title=Jobrunner&type=revision&diff=1770333&oldid=1766155.

In T129148#3616263, @demon wrote:

Docs lgtm

+1

All looks good.

Krinkle removed a project: Patch-For-Review.Sep 18 2017, 9:07 PM

Krinkle updated the task description. (Show Details)

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.

thcipriani updated the task description. (Show Details)Sep 18 2017, 9:07 PM

thcipriani closed this task as Resolved.Sep 18 2017, 9:10 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban).Sep 26 2017, 11:47 PM

Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner)
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

No restart for non-active DC

What a deploy would look like

Restarting multiple services

	thcipriani
	Mar 7 2016, 9:18 PM

Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

No restart for non-active DC

What a deploy would look like

Restarting multiple services

Deploy jobrunner with scap3 (Trebuchet jobrunner/jobrunner)
Closed, ResolvedPublic
Actions

Related Objects
Search...