Page MenuHomePhabricator

Add a jobrunner server to the Scap canary pool
Closed, ResolvedPublic

Description

Follows-up:

  1. Ensure that (if not already) at least one job runner is included in the list of canary servers that Scap uses for deploying MediaWiki code. This alone will already be an improvement, as any hits for mediawiki/exception, mediawiki/error or hhvm that only happen in job runner context would then be caught early.
  2. Include ERROR (and higher) severity messages from the mediawiki/runJobs channel in the Logstash query for canary monitoring.
  3. Once the jobrunner and jobchron service logs are indexed by Logstash, include ERROR (and higher) severity messages in the Logstash query.

Note that the jobrunner and jobchron services are independent PHP CLI programs (not MediaWiki cli scripts) so their logs will have a different type, and are not presently included anywhere else.

Event Timeline

greg subscribed.

Adding our Release-Engineering-Team (Kanban) project as we would like to work on this in the coming quarter or two (no promises though, this is not a "goal" only "other hoped for work").

Krinkle renamed this task from Add jobrunners to Scap canary process to Add jobrunner servers to Scap canary process.Jul 12 2018, 3:59 AM
Krinkle added a project: WMF-JobQueue.
Krinkle moved this task from Untriaged to Meta on the WMF-JobQueue board.
Krinkle renamed this task from Add jobrunner servers to Scap canary process to Add a jobrunner server to the Scap canary pool.May 7 2020, 10:28 PM

Is this still relevant with the new K8s infrastructure?

I assume Scap still has the concept of applying the next image to a canary pool in mw-on-k8s first, waiting some time for a potential Logstash error rate increase, and then deciding whether to proceed.

Unless a canary pool was introduced for mw-jobrunner since then, this is presumably still limited to the mw-web and mw-api server groups, and thus still an issue.

akosiaris claimed this task.
akosiaris subscribed.

I assume Scap still has the concept of applying the next image to a canary pool in mw-on-k8s first, waiting some time for a potential Logstash error rate increase, and then deciding whether to proceed.

Unless a canary pool was introduced for mw-jobrunner since then, this is presumably still limited to the mw-web and mw-api server groups, and thus still an issue.

There was. This can be discerned very easily via a search in the deployment-charts repo and is visible in https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/mw-jobrunner/helmfile.yaml#L20

Note the canary release mentioned in lines 20 and 24 (for eqiad and codfw respectively).

The glue with scap is defined in https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/deployment_server/kubernetes.yaml#L207

Where the profile::kubernetes::deployment_server::mediawiki::release::mw_releases stanza defines kinds, flavours and stages. At line 251 there is the stanza for mw-jobrunner and names the canary release, which is being upgrades during the scap canaries stage. This is also how any future release in any service that gets deployed by scap could get used during the canaries stage of scap.

Given the above, I am gonna be bold and resolve this, I think the needs of the task have been resolved for at least 1 year now (probably more, but I 'll avoid going down the successive git blame rabbit hole).