Page MenuHomePhabricator

Scap deployments to mw-on-k8s timing out
Closed, ResolvedPublic

Description

Seen today during the train presync for 1.42.0-wmf.21 (executed manually during morning UTC window)

Backscroll: P58470

Unlike T359114 the timeouts didn't get as far as parsoid and already happened for the K8s testservers:

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-debug-deploy-eqiad.config
  Error: UPGRADE FAILED: release pinkunicorn failed, and has been rolled back due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-debug-deploy-eqiad.config
  Error: UPGRADE FAILED: release pinkunicorn failed, and has been rolled back due to atomic being set: timed out waiting for the condition

11:30:12 Finished Running helmfile -e eqiad --selector name=pinkunicorn apply in /srv/deployment-charts/helmfile.d/services/mw-debug (duration: 10m 12s)
11:30:12 K8s deployment to stage testservers failed: K8s deployment had the following errors:
 codfw: Deployment of mw-misc-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-debug-pinkunicorn failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=pinkunicorn', 'apply']' returned non-zero exit status 1.
eqiad: Deployment of mw-misc-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-debug-pinkunicorn failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=pinkunicorn', 'apply']' returned non-zero exit status 1.
11:30:12 Rolling back to prior state...

Also, no spike of resource requests can be seen at https://grafana-rw.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s&from=now-24h&to=now during that period (which would have been surprising for the testservers anyway):

image.png (332×921 px, 18 KB)

image.png (332×921 px, 24 KB)

Event Timeline

This was caused by an error with the php-fpm image introduced in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/994764
@jijiki is reverting this change and rebuilding the image, and doing a full rebuild of the mediawiki images following that.

Clement_Goubert claimed this task.

This is now resolved and the train is proceeding.