Page MenuHomePhabricator

Scap should check errors coming from mw-on-k8s canaries during deployments
Closed, ResolvedPublic

Description

As we near 50% of external traffic to MW-on-K8s, scap should start checking errors coming from its canary releases. Logstash can be filtered on the following criteria:

  • kubernetes.labels.deployment: mw-web, mw-api-int, mw-api-ext, mw-jobrunner
  • kubernetes.labels.release: canary

mw-misc and mw-wikifunctions do not currently have canaries.

If mw-debug log checking is required as well, the criteria is:

  • kubernetes.labels.deployment: mw-debug
  • kubernetes.labels.release: pinkunicorn, but can be omitted, as it is the only mw-debug release.

On serviceops side, we will set the number of replicas in the canary release to around 3% of the total deployment's replicas to more or less match the ratio that used to be on bare metal.

Event Timeline

Change 1002973 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Raise the number of canary replicas

https://gerrit.wikimedia.org/r/1002973

Change 1002973 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Raise the number of canary replicas

https://gerrit.wikimedia.org/r/1002973

Mentioned in SAL (#wikimedia-operations) [2024-02-13T15:29:14Z] <cgoubert@deploy2002> Started scap: mw-on-k8s: Raise the number of canary replicas - T357402

Mentioned in SAL (#wikimedia-operations) [2024-02-13T15:32:12Z] <cgoubert@deploy2002> Finished scap: mw-on-k8s: Raise the number of canary replicas - T357402 (duration: 02m 58s)

dancy triaged this task as Medium priority.

Change 1003885 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] logstash_checker.py: Add ability to check all MediaWiki canaries at once

https://gerrit.wikimedia.org/r/1003885

Change 1003885 merged by Clément Goubert:

[operations/puppet@production] logstash_checker.py: Add ability to check all MediaWiki canaries at once

https://gerrit.wikimedia.org/r/1003885

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:28:07Z] <dancy@deploy2002> Started scap: testing T357402

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:43:04Z] <dancy@deploy2002> sync-world aborted: testing T357402 (duration: 14m 57s)

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:45:29Z] <dancy@deploy2002> Started scap: testing T357402 again

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:54:27Z] <dancy@deploy2002> Finished scap: testing T357402 again (duration: 08m 58s)

Change 1007449 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] logstash_checker.py: Handle missing mediawiki_deployments_file

https://gerrit.wikimedia.org/r/1007449

Change 1007449 merged by Dzahn:

[operations/puppet@production] logstash_checker.py: Handle missing mediawiki_deployments_file

https://gerrit.wikimedia.org/r/1007449