Page MenuHomePhabricator

MediaWiki deployment to kubernetes fails on group1 promotion
Closed, ResolvedPublic

Description

While promoting group 1 wikis to 1.41.0-wmf.16 (T340244), scap failed when driving doing the deployment to Kubernetes when deploying the mw-api-ext service with the commands:

helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext

Output:

08:39:20 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
08:39:20 Stdout/stderr follows:
08:39:20 helmfile.yaml: basePath=.
skipping missing values file matching "values-main.yaml"
Comparing release=main, chart=wmf-stable/mediawiki
mw-api-ext, mw-api-ext.eqiad.main, Deployment (apps) has changed:
<lot of yaml>
          ### The apache httpd container
          # TODO: set up logging. See T265876
          # TODO: fix virtualhosts in puppet so that the port is set to APACHE_RUN_PORT
          - name: mediawiki-main-httpd
-           image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2023-07-05-000244-webserver
+           image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2023-07-05-082648-webserver
<lot of yaml>
          ### The MediaWiki container
          - name: mediawiki-main-app
-           image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2023-07-05-000226-publish
+           image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2023-07-05-082630-publish
<lot of yaml>
skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki


FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: upgrade (7 bytes)
  2: --install (9 bytes)
  3: --reset-values (14 bytes)
  4: main (4 bytes)
  5: wmf-stable/mediawiki (20 bytes)
  6: --timeout (9 bytes)
  7: 600s (4 bytes)
  8: --atomic (8 bytes)
  9: --namespace (11 bytes)
  10: mw-api-ext (10 bytes)
  11: --values (8 bytes)
  12: /tmp/values192667939 (20 bytes)
  13: --values (8 bytes)
  14: /tmp/values076077094 (20 bytes)
  15: --values (8 bytes)
  16: /tmp/values423039309 (20 bytes)
  17: --values (8 bytes)
  18: /tmp/values631538248 (20 bytes)
  19: --values (8 bytes)
  20: /tmp/values790591239 (20 bytes)
  21: --values (8 bytes)
  22: /tmp/values930876090 (20 bytes)
  23: --values (8 bytes)
  24: /tmp/values251004625 (20 bytes)
  25: --values (8 bytes)
  26: /tmp/values327713276 (20 bytes)
  27: --values (8 bytes)
  28: /tmp/values712276267 (20 bytes)
  29: --values (8 bytes)
  30: /tmp/values714975374 (20 bytes)
  31: --values (8 bytes)
  32: /tmp/values284296853 (20 bytes)
  33: --set (5 bytes)
  34: php.servergroup=kube-mw-api-ext (31 bytes)
  35: --history-max (13 bytes)
  36: 10 (2 bytes)
  37: --kubeconfig (12 bytes)
  38: /etc/kubernetes/mw-api-ext-deploy-eqiad.config (46 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

08:39:20 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 15s)

The same happens for codfw:

08:39:49 Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
...
mw-api-ext, mw-api-ext.codfw.main, Deployment (apps) has changed:
...
          - name: mediawiki-main-httpd
-           image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2023-07-05-000244-webserver
+           image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2023-07-05-082648-webserver
...
          ### The MediaWiki container
          - name: mediawiki-main-app
-           image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2023-07-05-000226-publish
+           image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2023-07-05-082630-publish
...
08:39:49 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 44s)

Event Timeline

hashar triaged this task as Unbreak Now! priority.Jul 5 2023, 8:59 AM

The helm output mentions a timeout:

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

For other services, the deployment takes less than a minute:

08:29:26 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 21s)
08:29:28 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 23s)
08:29:39 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 34s)
08:29:41 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 36s)
08:29:46 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 40s)
08:29:49 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 44s)

But mw-api-ext seems to have a 10 minutes timeout:

08:39:20 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 15s)
09:07:19 <taavi> looks like the namespace is getting out of quota, which is blocking the updated pods from starting
09:07:21 <taavi> Error creating: pods "mw-api-ext.eqiad.main-5dfdb996cf-g5hsz" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=8250m, used: limits.cpu=82500m, limited: limits.cpu=90

Change 935685 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Raise resource limits for mw-api-ext

https://gerrit.wikimedia.org/r/935685

Clement_Goubert changed the task status from Open to In Progress.Jul 5 2023, 9:16 AM
Clement_Goubert claimed this task.

Change 935685 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Raise resource limits for mw-api-ext

https://gerrit.wikimedia.org/r/935685

Mentioned in SAL (#wikimedia-operations) [2023-07-05T09:24:29Z] <claime> redeploy mw-on-k8s following quota update - T341114