During the UTC late backport window on March 4, 2024, running scap backport on the first patch got stuck deploying to test servers. Helm rollbacks were reported in my terminal and K8s deployments failed:
21:35:39 Finished Running helmfile -e eqiad --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 12s) 21:35:39 K8s deployment to stage canaries failed: K8s deployment had the following errors: codfw: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. eqiad: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. 21:35:39 Rolling back to prior state...
I tried to sync and encountered the following errors in my terminal while trying to scap backport on deploy2002.codfw.wmnet:
- "/bin/sh" - "-c" - "sleep 8" - name: mediawiki-main-php-fpm-exporter image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3 imagePullPolicy: IfNotPresent args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"] ports: - name: fpm-metrics containerPort: 9118 livenessProbe: tcpSocket: port: 9118 resources: requests: cpu: 100m memory: 50Mi limits: cpu: 500m memory: 200Mi lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" # TODO: understand how to make mcrouter use the # application CA when connecting to memcached via TLS - name: mediawiki-main-mcrouter image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022 imagePullPolicy: IfNotPresent env: - name: PORT value: "11213" - name: CONFIG value: "file:/etc/mcrouter/config.json" - name: ROUTE_PREFIX value: "eqiad/mw" - name: CROSS_REGION_TO value: "250" - name: CROSS_CLUSTER_TO value: "100" - name: NUM_PROXIES value: "5" - name: PROBE_TIMEOUT value: "60000" - name: TIMEOUTS_UNTIL_TKO value: "3" # We don't want to listen to TLS here. # TODO: check if it can connect with TLS without the TLS settings. - name: USE_SSL value: "no" ports: # Please note: this port is not exposed outside of the pod. - name: mcrouter containerPort: 11213 livenessProbe: tcpSocket: port: mcrouter readinessProbe: exec: command: - /bin/healthz lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" volumeMounts: - name: mediawiki-main-mcrouter mountPath: /etc/mcrouter resources: requests: cpu: 200m memory: 100Mi limits: cpu: 700m memory: 200Mi - name: mediawiki-main-mcrouter-exporter image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010 imagePullPolicy: IfNotPresent args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ] ports: # Port names are limited to 15 characters. - name: mcr-metrics containerPort: 9151 livenessProbe: tcpSocket: port: mcr-metrics resources: requests: cpu: 100m memory: 50Mi limits: cpu: 500m memory: 200Mi lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" - name: mediawiki-main-tls-proxy image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203 imagePullPolicy: IfNotPresent env: - name: SERVICE_NAME value: main - name: SERVICE_ZONE value: "default" - name: CONCURRENCY value: "12" - name: ADMIN_PORT value: "1666" - name: DRAIN_TIME_S value: "600" - name: DRAIN_STRATEGY value: gradual ports: - containerPort: 4447 readinessProbe: httpGet: path: /healthz port: 9361 volumeMounts: - name: envoy-config-volume mountPath: /etc/envoy/ readOnly: true - name: tls-certs-volume mountPath: /etc/envoy/ssl readOnly: true lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 7" resources: limits: memory: 350Mi requests: cpu: 200m memory: 100Mi - name: mediawiki-main-rsyslog image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3 imagePullPolicy: IfNotPresent env: - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: KUBERNETES_NODE valueFrom: fieldRef: fieldPath: spec.nodeName - name: KUBERNETES_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: KUBERNETES_RELEASE valueFrom: fieldRef: fieldPath: metadata.labels['release'] - name: KUBERNETES_DEPLOYMENT valueFrom: fieldRef: fieldPath: metadata.labels['deployment'] lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" resources: requests: cpu: 100m memory: 200Mi limits: cpu: 1 memory: 300Mi volumeMounts: # Mount the shared socket volume - name: mediawiki-main-rsyslog-config mountPath: /etc/rsyslog.d - name: php-logging mountPath: /var/log/php-fpm # We need to read files created by www-data and with mode # 0600 securityContext: runAsUser: 33 - name: statsd-exporter image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022 imagePullPolicy: IfNotPresent lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" resources: requests: cpu: 100m memory: 100M limits: cpu: 200m memory: 200M volumeMounts: - name: statsd-config mountPath: /etc/monitoring readOnly: true ports: - name: statsd-metrics containerPort: 9102 livenessProbe: tcpSocket: port: statsd-metrics volumes: # Apache sites - name: mediawiki-main-httpd-sites configMap: name: mediawiki-main-httpd-sites-config # Additional httpd debug configuration - name: mediawiki-main-httpd-early configMap: name: mediawiki-main-httpd-early-config # Datacenter - name: mediawiki-main-wikimedia-cluster configMap: name: mediawiki-main-wikimedia-cluster-config # sendmail configuration - name: mediawiki-main-mail configMap: name: mediawiki-main-mail-config # TLS configurations - name: envoy-config-volume configMap: name: mediawiki-main-envoy-config-volume - name: tls-certs-volume secret: secretName: mediawiki-main-tls-proxy-certs # Shared unix socket for php apps - name: shared-socket emptyDir: {} - name: mediawiki-main-wmerrors configMap: name: mediawiki-main-wmerrors # Mcrouter configuration if any # Mcrouter configuration - name: mediawiki-main-mcrouter configMap: name: mediawiki-main-mcrouter-config - name: mediawiki-main-rsyslog-config configMap: name: mediawiki-main-rsyslog-config - name: php-logging emptyDir: {} # GeoIP data - name: mediawiki-main-geoip hostPath: path: /usr/share/GeoIP - name: mediawiki-main-geoipinfo hostPath: path: /usr/share/GeoIPInfo - name: statsd-config configMap: name: mediawiki-main-statsd-configmap skipping missing values file matching "values-main.yaml" Upgrading release=main, chart=wmf-stable/mediawiki FAILED RELEASES: NAME main helmfile.yaml: basePath=. in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: upgrade (7 bytes) 2: --install (9 bytes) 3: --reset-values (14 bytes) 4: main (4 bytes) 5: wmf-stable/mediawiki (20 bytes) 6: --timeout (9 bytes) 7: 600s (4 bytes) 8: --atomic (8 bytes) 9: --namespace (11 bytes) 10: mw-api-ext (10 bytes) 11: --values (8 bytes) 12: /tmp/values568600304 (20 bytes) 13: --values (8 bytes) 14: /tmp/values225672591 (20 bytes) 15: --values (8 bytes) 16: /tmp/values217764258 (20 bytes) 17: --values (8 bytes) 18: /tmp/values300505753 (20 bytes) 19: --values (8 bytes) 20: /tmp/values705690404 (20 bytes) 21: --values (8 bytes) 22: /tmp/values972405299 (20 bytes) 23: --values (8 bytes) 24: /tmp/values533020150 (20 bytes) 25: --values (8 bytes) 26: /tmp/values099846365 (20 bytes) 27: --values (8 bytes) 28: /tmp/values399121048 (20 bytes) 29: --values (8 bytes) 30: /tmp/values643515671 (20 bytes) 31: --values (8 bytes) 32: /tmp/values480956810 (20 bytes) 33: --set (5 bytes) 34: php.servergroup=kube-mw-api-ext (31 bytes) 35: --history-max (13 bytes) 36: 10 (2 bytes) 37: --kubeconfig (12 bytes) 38: /etc/kubernetes/mw-api-ext-deploy-eqiad.config (46 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition COMBINED OUTPUT: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition 21:46:40 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 14s) 21:46:41 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. 21:46:41 Stdout/stderr follows: 21:46:41 helmfile.yaml: basePath=. skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml" skipping missing values file matching "values-main.yaml" Comparing release=main, chart=wmf-stable/mediawiki mw-parsoid, mw-parsoid.eqiad.main, Deployment (apps) has changed: # Source: mediawiki/templates/deployment.yaml.tpl apiVersion: apps/v1 kind: Deployment metadata: name: mw-parsoid.eqiad.main labels: app: mediawiki chart: mediawiki-0.6.20 release: main heritage: Helm spec: selector: matchLabels: app: mediawiki release: main replicas: 33 strategy: rollingUpdate: maxSurge: 10% maxUnavailable: 10% type: RollingUpdate template: metadata: labels: app: mediawiki release: main deployment: mw-parsoid routed_via: main annotations: checksum/sites: 797f6a05f681fdc18612dd35d27a01257cd6ca744bd52dffd1ef2925469359d2 checksum/rsyslog: 7d7443ef8588edd1f9f2f08939c56be288d0b687d759209056c89cdcb1d5936d prometheus.io/scrape_by_name: "true" checksum/prometheus-statsd: 1b90efc073a70aa3d97b1f2280969873e2dbc1a35c735da94580288291fb7214 checksum/tls-config: 03480fc8ed9a428202856c734f64eb185080b05a86c056c8bb258dfff704de87 envoyproxy.io/scrape: "true" envoyproxy.io/port: "9361" spec: # TODO: add affinity rules to ensure even distribution across rows nodeSelector: node.kubernetes.io/disk-type: ssd terminationGracePeriodSeconds: 10 containers: ### The apache httpd container # TODO: set up logging. See T265876 # TODO: fix virtualhosts in puppet so that the port is set to APACHE_RUN_PORT - name: mediawiki-main-httpd - image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-02-29-215143-webserver + image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-03-04-211219-webserver imagePullPolicy: IfNotPresent env: - name: FCGI_MODE value: FCGI_UNIX - name: SERVERGROUP value: kube-mw-parsoid - name: APACHE_RUN_PORT value: "8080" # Set the pod name as the value of the Server: header. - name: SERVER_SIGNATURE valueFrom: fieldRef: fieldPath: metadata.name # Log in ecs format - name: LOG_FORMAT value: ecs_rsyslog # Do not log monitoring requests - name: LOG_SKIP_SYSTEM value: "1" ports: - name: httpd containerPort: 8080 # PHP monitoring port - name: php-metrics containerPort: 9181 livenessProbe: tcpSocket: port: httpd readinessProbe: httpGet: # this is the simplest php script you can think of - it just returns OK. # This way, we're just testing that apache + php-fpm are ready. # mcrouter, if enabled, should have its own readiness probe probably. path: /healthz port: php-metrics lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" resources: requests: cpu: 200m memory: 200Mi limits: cpu: 500m memory: 400Mi volumeMounts: # Mount the shared socket volume - name: shared-socket mountPath: /run/shared # Note: we use subpaths here. Given subpaths are implemented with bind mounts, # they won't be updated when the configmap is updated. # This is ok because we're re-deploying the pods when that happens. - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/02-redirects.conf subPath: 02-redirects.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/03-main.conf subPath: 03-main.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/04-remnant.conf subPath: 04-remnant.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/06-secure.wikimedia.conf subPath: 06-secure.wikimedia.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/07-wikimania.conf subPath: 07-wikimania.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/08-foundation.conf subPath: 08-foundation.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/09-wikimedia.conf subPath: 09-wikimedia.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/10-www.wikimedia.org.conf subPath: 10-www.wikimedia.org.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/00-nonexistent.conf subPath: 00-nonexistent.conf - name: mediawiki-main-httpd-sites mountPath: /etc/apache2/sites-enabled/01-wwwportals.conf subPath: 01-wwwportals.conf # Allow us to inject configurations *before* everything else is evaluated # To this end we also pick a non-descriptive name that OTOH guarantees # the configuration will be loaded soon. # See apache.conf in the mediawiki-httpd image to see precisely when this is loaded. - name: mediawiki-main-httpd-early mountPath: /etc/apache2/conf-enabled/00-aaa.conf subPath: 00-aaa.conf ### The MediaWiki container - name: mediawiki-main-app - image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-02-29-215132-publish + image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-03-04-211208-publish imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["SYS_PTRACE"] # This is needed to produce a slow log lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" env: - name: SERVERGROUP value: kube-mw-parsoid - name: FCGI_MODE value: FCGI_UNIX - name: FCGI_URL value: "unix:///run/shared/fpm-www.sock" - name: PHP__opcache__memory_consumption value: "500" - name: PHP__opcache__max_accelerated_files value: "32531" - name: PHP__opcache__interned_strings_buffer value: "50" - name: PHP__auto_prepend_file value: "/srv/mediawiki/wmf-config/PhpAutoPrepend.php" - name: FPM__request_terminate_timeout value: "60" - name: PHP__apc__shm_size value: 768M - name: FPM__pm__max_children value: "8" - name: FPM__request_slowlog_timeout value: "5" - name: PHP__display_errors value: "Off" - name: PHP__error_reporting value: "30719" - name: PHP__pcre__backtrack_limit value: "5000000" - name: PHP__max_execution_time value: "210" - name: PHP__error_log value: /var/log/php-fpm/error.log - name: FCGI_ALLOW value: "127.0.0.1" - name: FPM__slowlog value: /var/log/php-fpm/slowlog.log # See T276908 livenessProbe: exec: command: - /usr/bin/test - -S - /run/shared/fpm-www.sock initialDelaySeconds: 1 periodSeconds: 5 resources: requests: # CPU calculation: # Minimum 1 whole CPU # Multiply the amount of cpu_per_worker (float, unit: cpu, ex: 0.5 is half a CPU per worker) # by the number of configured workers + 1 (to take into account the main php-fpm process) cpu: 4.5 # RAM calculation: # Multiply 50% of the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process) # Add 50% of the opcache size and the apc size (close to the average real consumption) memory: 2634Mi limits: # RAM calculation: # Multiply the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process) # Add 50% of the opcache size and the apc size (close to the average real consumption) memory: 4634Mi volumeMounts: # TODO: use an env variable for this. - name: mediawiki-main-wikimedia-cluster mountPath: /etc/wikimedia-cluster subPath: wikimedia-cluster # mount the volume as the www-data home. This allows us to change # the configuration of this configmap without causing a mediawiki # redeployment as well, which we'd need if we were using a subpath. - name: mediawiki-main-mail mountPath: /var/www # Mount the shared socket volume - name: shared-socket mountPath: /run/shared # php-wmerrors configuration - name: mediawiki-main-wmerrors mountPath: /etc/wmerrors - name: php-logging mountPath: /var/log/php-fpm # GeoIP data - name: mediawiki-main-geoip mountPath: /usr/share/GeoIP/ readOnly: true - name: mediawiki-main-geoipinfo mountPath: /usr/share/GeoIPInfo/ readOnly: true # Add the following exporters: # php-fpm exporter # apache exporter on port 9117 - name: mediawiki-main-httpd-exporter image: docker-registry.discovery.wmnet/prometheus-apache-exporter:0.0.3 imagePullPolicy: IfNotPresent args: ["-scrape_uri", "http://127.0.0.1:9181/server-status?auto"] ports: - name: httpd-metrics containerPort: 9117 livenessProbe: tcpSocket: port: 9117 resources: requests: cpu: 100m memory: 50Mi limits: cpu: 500m memory: 200Mi lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" - name: mediawiki-main-php-fpm-exporter image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3 imagePullPolicy: IfNotPresent args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"] ports: - name: fpm-metrics containerPort: 9118 livenessProbe: tcpSocket: port: 9118 resources: requests: cpu: 100m memory: 50Mi limits: cpu: 500m memory: 200Mi lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" # TODO: understand how to make mcrouter use the # application CA when connecting to memcached via TLS - name: mediawiki-main-mcrouter image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022 imagePullPolicy: IfNotPresent env: - name: PORT value: "11213" - name: CONFIG value: "file:/etc/mcrouter/config.json" - name: ROUTE_PREFIX value: "eqiad/mw" - name: CROSS_REGION_TO value: "250" - name: CROSS_CLUSTER_TO value: "100" - name: NUM_PROXIES value: "5" - name: PROBE_TIMEOUT value: "60000" - name: TIMEOUTS_UNTIL_TKO value: "3" # We don't want to listen to TLS here. # TODO: check if it can connect with TLS without the TLS settings. - name: USE_SSL value: "no" ports: # Please note: this port is not exposed outside of the pod. - name: mcrouter containerPort: 11213 livenessProbe: tcpSocket: port: mcrouter readinessProbe: exec: command: - /bin/healthz lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" volumeMounts: - name: mediawiki-main-mcrouter mountPath: /etc/mcrouter resources: requests: cpu: 200m memory: 100Mi limits: cpu: 700m memory: 200Mi - name: mediawiki-main-mcrouter-exporter image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010 imagePullPolicy: IfNotPresent args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ] ports: # Port names are limited to 15 characters. - name: mcr-metrics containerPort: 9151 livenessProbe: tcpSocket: port: mcr-metrics resources: requests: cpu: 100m memory: 50Mi limits: cpu: 500m memory: 200Mi lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" - name: mediawiki-main-tls-proxy image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203 imagePullPolicy: IfNotPresent env: - name: SERVICE_NAME value: main - name: SERVICE_ZONE value: "default" - name: CONCURRENCY value: "12" - name: ADMIN_PORT value: "1666" - name: DRAIN_TIME_S value: "600" - name: DRAIN_STRATEGY value: gradual ports: - containerPort: 4452 readinessProbe: httpGet: path: /healthz port: 9361 volumeMounts: - name: envoy-config-volume mountPath: /etc/envoy/ readOnly: true - name: tls-certs-volume mountPath: /etc/envoy/ssl readOnly: true lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 7" resources: limits: memory: 350Mi requests: cpu: 200m memory: 100Mi - name: mediawiki-main-rsyslog image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3 imagePullPolicy: IfNotPresent env: - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: KUBERNETES_NODE valueFrom: fieldRef: fieldPath: spec.nodeName - name: KUBERNETES_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: KUBERNETES_RELEASE valueFrom: fieldRef: fieldPath: metadata.labels['release'] - name: KUBERNETES_DEPLOYMENT valueFrom: fieldRef: fieldPath: metadata.labels['deployment'] lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" resources: requests: cpu: 100m memory: 200Mi limits: cpu: 1 memory: 300Mi volumeMounts: # Mount the shared socket volume - name: mediawiki-main-rsyslog-config mountPath: /etc/rsyslog.d - name: php-logging mountPath: /var/log/php-fpm # We need to read files created by www-data and with mode # 0600 securityContext: runAsUser: 33 - name: statsd-exporter image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022 imagePullPolicy: IfNotPresent lifecycle: preStop: exec: command: - "/bin/sh" - "-c" - "sleep 8" resources: requests: cpu: 100m memory: 100M limits: cpu: 200m memory: 200M volumeMounts: - name: statsd-config mountPath: /etc/monitoring readOnly: true ports: - name: statsd-metrics containerPort: 9102 livenessProbe: tcpSocket: port: statsd-metrics volumes: # Apache sites - name: mediawiki-main-httpd-sites configMap: name: mediawiki-main-httpd-sites-config # Additional httpd debug configuration - name: mediawiki-main-httpd-early configMap: name: mediawiki-main-httpd-early-config # Datacenter - name: mediawiki-main-wikimedia-cluster configMap: name: mediawiki-main-wikimedia-cluster-config # sendmail configuration - name: mediawiki-main-mail configMap: name: mediawiki-main-mail-config # TLS configurations - name: envoy-config-volume configMap: name: mediawiki-main-envoy-config-volume - name: tls-certs-volume secret: secretName: mediawiki-main-tls-proxy-certs # Shared unix socket for php apps - name: shared-socket emptyDir: {} - name: mediawiki-main-wmerrors configMap: name: mediawiki-main-wmerrors # Mcrouter configuration if any # Mcrouter configuration - name: mediawiki-main-mcrouter configMap: name: mediawiki-main-mcrouter-config - name: mediawiki-main-rsyslog-config configMap: name: mediawiki-main-rsyslog-config - name: php-logging emptyDir: {} # GeoIP data - name: mediawiki-main-geoip hostPath: path: /usr/share/GeoIP - name: mediawiki-main-geoipinfo hostPath: path: /usr/share/GeoIPInfo - name: statsd-config configMap: name: mediawiki-main-statsd-configmap skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml" skipping missing values file matching "values-main.yaml" Upgrading release=main, chart=wmf-stable/mediawiki FAILED RELEASES: NAME main helmfile.yaml: basePath=. in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: upgrade (7 bytes) 2: --install (9 bytes) 3: --reset-values (14 bytes) 4: main (4 bytes) 5: wmf-stable/mediawiki (20 bytes) 6: --timeout (9 bytes) 7: 600s (4 bytes) 8: --atomic (8 bytes) 9: --namespace (11 bytes) 10: mw-parsoid (10 bytes) 11: --values (8 bytes) 12: /tmp/values618884843 (20 bytes) 13: --values (8 bytes) 14: /tmp/values401253198 (20 bytes) 15: --values (8 bytes) 16: /tmp/values183007317 (20 bytes) 17: --values (8 bytes) 18: /tmp/values921299376 (20 bytes) 19: --values (8 bytes) 20: /tmp/values029269839 (20 bytes) 21: --values (8 bytes) 22: /tmp/values952661090 (20 bytes) 23: --values (8 bytes) 24: /tmp/values893283417 (20 bytes) 25: --values (8 bytes) 26: /tmp/values685514724 (20 bytes) 27: --values (8 bytes) 28: /tmp/values397940723 (20 bytes) 29: --values (8 bytes) 30: /tmp/values223566006 (20 bytes) 31: --set (5 bytes) 32: php.servergroup=kube-mw-parsoid (31 bytes) 33: --history-max (13 bytes) 34: 10 (2 bytes) 35: --kubeconfig (12 bytes) 36: /etc/kubernetes/mw-parsoid-deploy-eqiad.config (46 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition COMBINED OUTPUT: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition 21:46:41 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 15s) 21:46:41 K8s deployment to stage production failed: K8s deployment had the following errors: codfw: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. eqiad: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-web-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1. 21:46:41 Rolling back to prior state... 21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner 21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web 21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions 21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid 21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext 21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner 21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int 21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int 21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext 21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid 21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions 21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web 21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s) 21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s) 21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 02s) 21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s) 21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s) 21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s) 21:46:44 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s) 21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s) 21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s) 21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s) 21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 03s) 21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s) 21:46:45 Rollback completed 21:46:45 Finished sync-prod-k8s (duration: 10m 20s) 21:46:45 Started sync-proxies 21:46:52 sync-proxies: 100% (in-flight: 0; ok: 8; fail: 0; left: 0) 21:46:52 Per-host sync duration: average 5.3s, median 5.3s 21:46:52 rsync transfer: average 381,975 bytes/host, total 3,055,800 bytes 21:46:52 Finished sync-proxies (duration: 00m 06s) 21:46:52 Started sync-apaches 21:47:27 sync-apaches: 100% (in-flight: 0; ok: 209; fail: 0; left: 0) 21:47:27 Per-host sync duration: average 5.7s, median 4.9s 21:47:27 rsync transfer: average 381,975 bytes/host, total 79,832,775 bytes 21:47:27 Finished sync-apaches (duration: 00m 35s) 21:47:27 Started scap-cdb-rebuild 21:47:31 scap-cdb-rebuild: 100% (in-flight: 0; ok: 225; fail: 0; left: 0) 21:47:31 Finished scap-cdb-rebuild (duration: 00m 03s) 21:47:31 Started sync_wikiversions 21:47:34 sync_wikiversions: 100% (in-flight: 0; ok: 225; fail: 0; left: 0) 21:47:34 Finished sync_wikiversions (duration: 00m 03s) 21:47:34 Started php-fpm-restarts 21:47:34 Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807' on 183 host(s) 21:50:37 php-fpm-restart: 100% (in-flight: 0; ok: 183; fail: 0; left: 0) 21:50:37 Finished php-fpm-restarts (duration: 03m 02s) 21:50:37 Running /usr/local/bin/mwscript purgeMessageBlobStore.php 21:50:38 Finished scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] (duration: 38m 34s) 21:50:38 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=houseblaster', 'Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]]']' returned non-zero exit status 1.
I ended up aborting the backport window because of the errors above trying to deploy the first simple config change in the queue.