Page MenuHomePhabricator

Slow and failed deployments
Closed, ResolvedPublic

Description

During the UTC late backport window on March 4, 2024, running scap backport on the first patch got stuck deploying to test servers. Helm rollbacks were reported in my terminal and K8s deployments failed:

21:35:39 Finished Running helmfile -e eqiad --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 12s)
21:35:39 K8s deployment to stage canaries failed: K8s deployment had the following errors:
 codfw: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
eqiad: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
21:35:39 Rolling back to prior state...

I tried to sync and encountered the following errors in my terminal while trying to scap backport on deploy2002.codfw.wmnet:

                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
          - name: mediawiki-main-php-fpm-exporter
            image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3
            imagePullPolicy: IfNotPresent
            args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"]
            ports:
              - name: fpm-metrics
                containerPort: 9118
            livenessProbe:
              tcpSocket:
                port: 9118
            resources:
              requests:
                cpu: 100m
                memory: 50Mi
              limits:
                cpu: 500m
                memory: 200Mi

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"

          # TODO: understand how to make mcrouter use the
          # application CA when connecting to memcached via TLS
          - name: mediawiki-main-mcrouter
            image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022
            imagePullPolicy: IfNotPresent
            env:
              - name: PORT
                value: "11213"
              - name: CONFIG
                value: "file:/etc/mcrouter/config.json"
              - name: ROUTE_PREFIX
                value: "eqiad/mw"
              - name: CROSS_REGION_TO
                value: "250"
              - name: CROSS_CLUSTER_TO
                value: "100"
              - name: NUM_PROXIES
                value: "5"
              - name: PROBE_TIMEOUT
                value: "60000"
              - name: TIMEOUTS_UNTIL_TKO
                value: "3"
              # We don't want to listen to TLS here.
              # TODO: check if it can connect with TLS without the TLS settings.
              - name: USE_SSL
                value: "no"
            ports:
            # Please note: this port is not exposed outside of the pod.
              - name: mcrouter
                containerPort: 11213
            livenessProbe:
              tcpSocket:
                port: mcrouter
            readinessProbe:
              exec:
                command:
                  - /bin/healthz

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            volumeMounts:
              - name: mediawiki-main-mcrouter
                mountPath: /etc/mcrouter
            resources:
              requests:
                cpu: 200m
                memory: 100Mi
              limits:
                cpu: 700m
                memory: 200Mi
          - name: mediawiki-main-mcrouter-exporter
            image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010
            imagePullPolicy: IfNotPresent
            args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ]
            ports:
            # Port names are limited to 15 characters.
            - name: mcr-metrics
              containerPort: 9151
            livenessProbe:
              tcpSocket:
                port: mcr-metrics
            resources:
              requests:
                cpu: 100m
                memory: 50Mi
              limits:
                cpu: 500m
                memory: 200Mi

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
          - name: mediawiki-main-tls-proxy
            image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203
            imagePullPolicy: IfNotPresent
            env:
              - name: SERVICE_NAME
                value: main
              - name: SERVICE_ZONE
                value: "default"
              - name: CONCURRENCY
                value: "12"
              - name: ADMIN_PORT
                value: "1666"
              - name: DRAIN_TIME_S
                value: "600"
              - name: DRAIN_STRATEGY
                value: gradual
            ports:
              - containerPort: 4447
            readinessProbe:
              httpGet:
                path: /healthz
                port: 9361
            volumeMounts:
              - name: envoy-config-volume
                mountPath: /etc/envoy/
                readOnly: true
              - name: tls-certs-volume
                mountPath: /etc/envoy/ssl
                readOnly: true

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 7"

            resources:
              limits:
                memory: 350Mi
              requests:
                cpu: 200m
                memory: 100Mi
          - name: mediawiki-main-rsyslog
            image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
            imagePullPolicy: IfNotPresent
            env:
              - name: KUBERNETES_NAMESPACE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.namespace
              - name: KUBERNETES_NODE
                valueFrom:
                  fieldRef:
                    fieldPath: spec.nodeName
              - name: KUBERNETES_POD_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: KUBERNETES_RELEASE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['release']
              - name: KUBERNETES_DEPLOYMENT
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['deployment']

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            resources:
              requests:
                cpu: 100m
                memory: 200Mi
              limits:
                cpu: 1
                memory: 300Mi
            volumeMounts:
              # Mount the shared socket volume
            - name: mediawiki-main-rsyslog-config
              mountPath: /etc/rsyslog.d
            - name: php-logging
              mountPath: /var/log/php-fpm
            # We need to read files created by www-data and with mode
            # 0600
            securityContext:
              runAsUser: 33
          - name: statsd-exporter
            image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022
            imagePullPolicy: IfNotPresent

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            resources:
              requests:
                cpu: 100m
                memory: 100M
              limits:
                cpu: 200m
                memory: 200M
            volumeMounts:
              - name: statsd-config
                mountPath: /etc/monitoring
                readOnly: true
            ports:
            - name: statsd-metrics
              containerPort: 9102
            livenessProbe:
              tcpSocket:
                port: statsd-metrics
        volumes:

          # Apache sites
          - name: mediawiki-main-httpd-sites
            configMap:
              name: mediawiki-main-httpd-sites-config
          # Additional httpd debug configuration
          - name: mediawiki-main-httpd-early
            configMap:
              name: mediawiki-main-httpd-early-config
          # Datacenter
          - name: mediawiki-main-wikimedia-cluster
            configMap:
              name: mediawiki-main-wikimedia-cluster-config
          # sendmail configuration
          - name: mediawiki-main-mail
            configMap:
              name: mediawiki-main-mail-config
          # TLS configurations
          - name: envoy-config-volume
            configMap:
              name: mediawiki-main-envoy-config-volume
          - name: tls-certs-volume
            secret:
              secretName: mediawiki-main-tls-proxy-certs
          # Shared unix socket for php apps
          - name: shared-socket
            emptyDir: {}
          - name: mediawiki-main-wmerrors
            configMap:
              name: mediawiki-main-wmerrors
          # Mcrouter configuration if any

          # Mcrouter configuration
          - name: mediawiki-main-mcrouter
            configMap:
              name: mediawiki-main-mcrouter-config
          - name: mediawiki-main-rsyslog-config
            configMap:
              name: mediawiki-main-rsyslog-config
          - name: php-logging
            emptyDir: {}
          # GeoIP data
          - name: mediawiki-main-geoip
            hostPath:
              path: /usr/share/GeoIP
          - name: mediawiki-main-geoipinfo
            hostPath:
              path: /usr/share/GeoIPInfo

          - name: statsd-config
            configMap:
              name: mediawiki-main-statsd-configmap

skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki

FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: upgrade (7 bytes)
  2: --install (9 bytes)
  3: --reset-values (14 bytes)
  4: main (4 bytes)
  5: wmf-stable/mediawiki (20 bytes)
  6: --timeout (9 bytes)
  7: 600s (4 bytes)
  8: --atomic (8 bytes)
  9: --namespace (11 bytes)
  10: mw-api-ext (10 bytes)
  11: --values (8 bytes)
  12: /tmp/values568600304 (20 bytes)
  13: --values (8 bytes)
  14: /tmp/values225672591 (20 bytes)
  15: --values (8 bytes)
  16: /tmp/values217764258 (20 bytes)
  17: --values (8 bytes)
  18: /tmp/values300505753 (20 bytes)
  19: --values (8 bytes)
  20: /tmp/values705690404 (20 bytes)
  21: --values (8 bytes)
  22: /tmp/values972405299 (20 bytes)
  23: --values (8 bytes)
  24: /tmp/values533020150 (20 bytes)
  25: --values (8 bytes)
  26: /tmp/values099846365 (20 bytes)
  27: --values (8 bytes)
  28: /tmp/values399121048 (20 bytes)
  29: --values (8 bytes)
  30: /tmp/values643515671 (20 bytes)
  31: --values (8 bytes)
  32: /tmp/values480956810 (20 bytes)
  33: --set (5 bytes)
  34: php.servergroup=kube-mw-api-ext (31 bytes)
  35: --history-max (13 bytes)
  36: 10 (2 bytes)
  37: --kubeconfig (12 bytes)
  38: /etc/kubernetes/mw-api-ext-deploy-eqiad.config (46 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

21:46:40 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 14s)
21:46:41 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
21:46:41 Stdout/stderr follows:
21:46:41 helmfile.yaml: basePath=.
skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=main, chart=wmf-stable/mediawiki
mw-parsoid, mw-parsoid.eqiad.main, Deployment (apps) has changed:
  # Source: mediawiki/templates/deployment.yaml.tpl
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: mw-parsoid.eqiad.main
    labels:
      app: mediawiki
      chart: mediawiki-0.6.20
      release: main
      heritage: Helm

  spec:
    selector:
      matchLabels:
        app: mediawiki
        release: main
    replicas: 33
    strategy:
        rollingUpdate:
          maxSurge: 10%
          maxUnavailable: 10%
        type: RollingUpdate
    template:
      metadata:
        labels:
          app: mediawiki
          release: main
          deployment: mw-parsoid
          routed_via: main
        annotations:
          checksum/sites: 797f6a05f681fdc18612dd35d27a01257cd6ca744bd52dffd1ef2925469359d2

          checksum/rsyslog: 7d7443ef8588edd1f9f2f08939c56be288d0b687d759209056c89cdcb1d5936d

          prometheus.io/scrape_by_name: "true"
          checksum/prometheus-statsd: 1b90efc073a70aa3d97b1f2280969873e2dbc1a35c735da94580288291fb7214
          checksum/tls-config: 03480fc8ed9a428202856c734f64eb185080b05a86c056c8bb258dfff704de87
          envoyproxy.io/scrape: "true"
          envoyproxy.io/port: "9361"
      spec:
        # TODO: add affinity rules to ensure even distribution across rows
        nodeSelector:
          node.kubernetes.io/disk-type: ssd
        terminationGracePeriodSeconds: 10
        containers:

          ### The apache httpd container
          # TODO: set up logging. See T265876
          # TODO: fix virtualhosts in puppet so that the port is set to APACHE_RUN_PORT
          - name: mediawiki-main-httpd
-           image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-02-29-215143-webserver
+           image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-03-04-211219-webserver
            imagePullPolicy: IfNotPresent
            env:
            - name: FCGI_MODE
              value: FCGI_UNIX
            - name: SERVERGROUP
              value: kube-mw-parsoid
            - name: APACHE_RUN_PORT
              value: "8080"
            # Set the pod name as the value of the Server: header.
            - name: SERVER_SIGNATURE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            # Log in ecs format
            - name: LOG_FORMAT
              value: ecs_rsyslog
            # Do not log monitoring requests
            - name: LOG_SKIP_SYSTEM
              value: "1"
            ports:
            - name: httpd
              containerPort: 8080
            # PHP monitoring port
            - name: php-metrics
              containerPort: 9181
            livenessProbe:
              tcpSocket:
                port: httpd
            readinessProbe:
              httpGet:
                # this is the simplest php script you can think of - it just returns OK.
                # This way, we're just testing that apache + php-fpm are ready.
                # mcrouter, if enabled, should have its own readiness probe probably.
                path: /healthz
                port: php-metrics

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            resources:
              requests:
                cpu: 200m
                memory: 200Mi
              limits:
                cpu: 500m
                memory: 400Mi
            volumeMounts:
              # Mount the shared socket volume
            - name: shared-socket
              mountPath: /run/shared
            # Note: we use subpaths here. Given subpaths are implemented with bind mounts,
            # they won't be updated when the configmap is updated.
            # This is ok because we're re-deploying the pods when that happens.
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/02-redirects.conf
              subPath: 02-redirects.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/03-main.conf
              subPath: 03-main.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/04-remnant.conf
              subPath: 04-remnant.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/06-secure.wikimedia.conf
              subPath: 06-secure.wikimedia.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/07-wikimania.conf
              subPath: 07-wikimania.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/08-foundation.conf
              subPath: 08-foundation.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/09-wikimedia.conf
              subPath: 09-wikimedia.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/10-www.wikimedia.org.conf
              subPath: 10-www.wikimedia.org.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/00-nonexistent.conf
              subPath: 00-nonexistent.conf
            - name: mediawiki-main-httpd-sites
              mountPath: /etc/apache2/sites-enabled/01-wwwportals.conf
              subPath: 01-wwwportals.conf
            # Allow us to inject configurations *before* everything else is evaluated
            # To this end we also pick a non-descriptive name that OTOH guarantees
            # the configuration will be loaded soon.
            # See apache.conf in the mediawiki-httpd image to see precisely when this is loaded.
            - name: mediawiki-main-httpd-early
              mountPath: /etc/apache2/conf-enabled/00-aaa.conf
              subPath: 00-aaa.conf
          ### The MediaWiki container
          - name: mediawiki-main-app
-           image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-02-29-215132-publish
+           image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-03-04-211208-publish
            imagePullPolicy: IfNotPresent
            securityContext:
              capabilities:
                add: ["SYS_PTRACE"] # This is needed to produce a slow log

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            env:
            - name: SERVERGROUP
              value: kube-mw-parsoid
            - name: FCGI_MODE
              value: FCGI_UNIX
            - name: FCGI_URL
              value: "unix:///run/shared/fpm-www.sock"
            - name: PHP__opcache__memory_consumption
              value: "500"
            - name: PHP__opcache__max_accelerated_files
              value: "32531"
            - name: PHP__opcache__interned_strings_buffer
              value: "50"
            - name: PHP__auto_prepend_file
              value: "/srv/mediawiki/wmf-config/PhpAutoPrepend.php"
            - name: FPM__request_terminate_timeout
              value: "60"
            - name: PHP__apc__shm_size
              value: 768M
            - name: FPM__pm__max_children
              value: "8"
            - name: FPM__request_slowlog_timeout
              value: "5"
            - name: PHP__display_errors
              value: "Off"
            - name: PHP__error_reporting
              value: "30719"
            - name: PHP__pcre__backtrack_limit
              value: "5000000"
            - name: PHP__max_execution_time
              value: "210"
            - name: PHP__error_log
              value: /var/log/php-fpm/error.log
            - name: FCGI_ALLOW
              value: "127.0.0.1"
            - name: FPM__slowlog
              value: /var/log/php-fpm/slowlog.log
            # See T276908
            livenessProbe:

              exec:
                command:
                - /usr/bin/test
                - -S
                - /run/shared/fpm-www.sock
              initialDelaySeconds: 1
              periodSeconds: 5
            resources:
              requests:
                # CPU calculation:
                # Minimum 1 whole CPU
                # Multiply the amount of cpu_per_worker (float, unit: cpu, ex: 0.5 is half a CPU per worker)
                # by the number of configured workers + 1 (to take into account the main php-fpm process)
                cpu: 4.5
                # RAM calculation:
                # Multiply 50% of the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)
                # Add 50% of the opcache size and the apc size (close to the average real consumption)
                memory: 2634Mi
              limits:
                # RAM calculation:
                # Multiply the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)
                # Add 50% of the opcache size and the apc size (close to the average real consumption)
                memory: 4634Mi
            volumeMounts:
            # TODO: use an env variable for this.
            - name: mediawiki-main-wikimedia-cluster
              mountPath: /etc/wikimedia-cluster
              subPath: wikimedia-cluster
            # mount the volume as the www-data home. This allows us to change
            # the configuration of this configmap without causing a mediawiki
            # redeployment as well, which we'd need if we were using a subpath.
            - name: mediawiki-main-mail
              mountPath: /var/www

            # Mount the shared socket volume
            - name: shared-socket
              mountPath: /run/shared

            # php-wmerrors configuration
            - name: mediawiki-main-wmerrors
              mountPath: /etc/wmerrors

            - name: php-logging
              mountPath: /var/log/php-fpm
            # GeoIP data
            - name: mediawiki-main-geoip
              mountPath: /usr/share/GeoIP/
              readOnly: true
            - name: mediawiki-main-geoipinfo
              mountPath: /usr/share/GeoIPInfo/
              readOnly: true
          # Add the following exporters:
          # php-fpm exporter
          # apache exporter on port 9117
          - name: mediawiki-main-httpd-exporter
            image: docker-registry.discovery.wmnet/prometheus-apache-exporter:0.0.3
            imagePullPolicy: IfNotPresent
            args: ["-scrape_uri", "http://127.0.0.1:9181/server-status?auto"]
            ports:
              - name: httpd-metrics
                containerPort: 9117
            livenessProbe:
              tcpSocket:
                port: 9117
            resources:
              requests:
                cpu: 100m
                memory: 50Mi
              limits:
                cpu: 500m
                memory: 200Mi

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
          - name: mediawiki-main-php-fpm-exporter
            image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3
            imagePullPolicy: IfNotPresent
            args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"]
            ports:
              - name: fpm-metrics
                containerPort: 9118
            livenessProbe:
              tcpSocket:
                port: 9118
            resources:
              requests:
                cpu: 100m
                memory: 50Mi
              limits:
                cpu: 500m
                memory: 200Mi

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"

          # TODO: understand how to make mcrouter use the
          # application CA when connecting to memcached via TLS
          - name: mediawiki-main-mcrouter
            image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022
            imagePullPolicy: IfNotPresent
            env:
              - name: PORT
                value: "11213"
              - name: CONFIG
                value: "file:/etc/mcrouter/config.json"
              - name: ROUTE_PREFIX
                value: "eqiad/mw"
              - name: CROSS_REGION_TO
                value: "250"
              - name: CROSS_CLUSTER_TO
                value: "100"
              - name: NUM_PROXIES
                value: "5"
              - name: PROBE_TIMEOUT
                value: "60000"
              - name: TIMEOUTS_UNTIL_TKO
                value: "3"
              # We don't want to listen to TLS here.
              # TODO: check if it can connect with TLS without the TLS settings.
              - name: USE_SSL
                value: "no"
            ports:
            # Please note: this port is not exposed outside of the pod.
              - name: mcrouter
                containerPort: 11213
            livenessProbe:
              tcpSocket:
                port: mcrouter
            readinessProbe:
              exec:
                command:
                  - /bin/healthz

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            volumeMounts:
              - name: mediawiki-main-mcrouter
                mountPath: /etc/mcrouter
            resources:
              requests:
                cpu: 200m
                memory: 100Mi
              limits:
                cpu: 700m
                memory: 200Mi
          - name: mediawiki-main-mcrouter-exporter
            image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010
            imagePullPolicy: IfNotPresent
            args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ]
            ports:
            # Port names are limited to 15 characters.
            - name: mcr-metrics
              containerPort: 9151
            livenessProbe:
              tcpSocket:
                port: mcr-metrics
            resources:
              requests:
                cpu: 100m
                memory: 50Mi
              limits:
                cpu: 500m
                memory: 200Mi

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
          - name: mediawiki-main-tls-proxy
            image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203
            imagePullPolicy: IfNotPresent
            env:
              - name: SERVICE_NAME
                value: main
              - name: SERVICE_ZONE
                value: "default"
              - name: CONCURRENCY
                value: "12"
              - name: ADMIN_PORT
                value: "1666"
              - name: DRAIN_TIME_S
                value: "600"
              - name: DRAIN_STRATEGY
                value: gradual
            ports:
              - containerPort: 4452
            readinessProbe:
              httpGet:
                path: /healthz
                port: 9361
            volumeMounts:
              - name: envoy-config-volume
                mountPath: /etc/envoy/
                readOnly: true
              - name: tls-certs-volume
                mountPath: /etc/envoy/ssl
                readOnly: true

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 7"

            resources:
              limits:
                memory: 350Mi
              requests:
                cpu: 200m
                memory: 100Mi
          - name: mediawiki-main-rsyslog
            image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
            imagePullPolicy: IfNotPresent
            env:
              - name: KUBERNETES_NAMESPACE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.namespace
              - name: KUBERNETES_NODE
                valueFrom:
                  fieldRef:
                    fieldPath: spec.nodeName
              - name: KUBERNETES_POD_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: KUBERNETES_RELEASE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['release']
              - name: KUBERNETES_DEPLOYMENT
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['deployment']

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            resources:
              requests:
                cpu: 100m
                memory: 200Mi
              limits:
                cpu: 1
                memory: 300Mi
            volumeMounts:
              # Mount the shared socket volume
            - name: mediawiki-main-rsyslog-config
              mountPath: /etc/rsyslog.d
            - name: php-logging
              mountPath: /var/log/php-fpm
            # We need to read files created by www-data and with mode
            # 0600
            securityContext:
              runAsUser: 33
          - name: statsd-exporter
            image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022
            imagePullPolicy: IfNotPresent

            lifecycle:
              preStop:
                exec:
                  command:
                  - "/bin/sh"
                  - "-c"
                  - "sleep 8"
            resources:
              requests:
                cpu: 100m
                memory: 100M
              limits:
                cpu: 200m
                memory: 200M
            volumeMounts:
              - name: statsd-config
                mountPath: /etc/monitoring
                readOnly: true
            ports:
            - name: statsd-metrics
              containerPort: 9102
            livenessProbe:
              tcpSocket:
                port: statsd-metrics
        volumes:

          # Apache sites
          - name: mediawiki-main-httpd-sites
            configMap:
              name: mediawiki-main-httpd-sites-config
          # Additional httpd debug configuration
          - name: mediawiki-main-httpd-early
            configMap:
              name: mediawiki-main-httpd-early-config
          # Datacenter
          - name: mediawiki-main-wikimedia-cluster
            configMap:
              name: mediawiki-main-wikimedia-cluster-config
          # sendmail configuration
          - name: mediawiki-main-mail
            configMap:
              name: mediawiki-main-mail-config
          # TLS configurations
          - name: envoy-config-volume
            configMap:
              name: mediawiki-main-envoy-config-volume
          - name: tls-certs-volume
            secret:
              secretName: mediawiki-main-tls-proxy-certs
          # Shared unix socket for php apps
          - name: shared-socket
            emptyDir: {}
          - name: mediawiki-main-wmerrors
            configMap:
              name: mediawiki-main-wmerrors
          # Mcrouter configuration if any

          # Mcrouter configuration
          - name: mediawiki-main-mcrouter
            configMap:
              name: mediawiki-main-mcrouter-config
          - name: mediawiki-main-rsyslog-config
            configMap:
              name: mediawiki-main-rsyslog-config
          - name: php-logging
            emptyDir: {}
          # GeoIP data
          - name: mediawiki-main-geoip
            hostPath:
              path: /usr/share/GeoIP
          - name: mediawiki-main-geoipinfo
            hostPath:
              path: /usr/share/GeoIPInfo

          - name: statsd-config
            configMap:
              name: mediawiki-main-statsd-configmap

skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki

FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: upgrade (7 bytes)
  2: --install (9 bytes)
  3: --reset-values (14 bytes)
  4: main (4 bytes)
  5: wmf-stable/mediawiki (20 bytes)
  6: --timeout (9 bytes)
  7: 600s (4 bytes)
  8: --atomic (8 bytes)
  9: --namespace (11 bytes)
  10: mw-parsoid (10 bytes)
  11: --values (8 bytes)
  12: /tmp/values618884843 (20 bytes)
  13: --values (8 bytes)
  14: /tmp/values401253198 (20 bytes)
  15: --values (8 bytes)
  16: /tmp/values183007317 (20 bytes)
  17: --values (8 bytes)
  18: /tmp/values921299376 (20 bytes)
  19: --values (8 bytes)
  20: /tmp/values029269839 (20 bytes)
  21: --values (8 bytes)
  22: /tmp/values952661090 (20 bytes)
  23: --values (8 bytes)
  24: /tmp/values893283417 (20 bytes)
  25: --values (8 bytes)
  26: /tmp/values685514724 (20 bytes)
  27: --values (8 bytes)
  28: /tmp/values397940723 (20 bytes)
  29: --values (8 bytes)
  30: /tmp/values223566006 (20 bytes)
  31: --set (5 bytes)
  32: php.servergroup=kube-mw-parsoid (31 bytes)
  33: --history-max (13 bytes)
  34: 10 (2 bytes)
  35: --kubeconfig (12 bytes)
  36: /etc/kubernetes/mw-parsoid-deploy-eqiad.config (46 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
  Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition

21:46:41 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 15s)
21:46:41 K8s deployment to stage production failed: K8s deployment had the following errors:
 codfw: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
eqiad: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
21:46:41 Rolling back to prior state...
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s)
21:46:44 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 03s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s)
21:46:45 Rollback completed
21:46:45 Finished sync-prod-k8s (duration: 10m 20s)
21:46:45 Started sync-proxies
21:46:52 sync-proxies: 100% (in-flight: 0; ok: 8; fail: 0; left: 0)
21:46:52 Per-host sync duration: average 5.3s, median 5.3s
21:46:52 rsync transfer: average 381,975 bytes/host, total 3,055,800 bytes
21:46:52 Finished sync-proxies (duration: 00m 06s)
21:46:52 Started sync-apaches
21:47:27 sync-apaches: 100% (in-flight: 0; ok: 209; fail: 0; left: 0)
21:47:27 Per-host sync duration: average 5.7s, median 4.9s
21:47:27 rsync transfer: average 381,975 bytes/host, total 79,832,775 bytes
21:47:27 Finished sync-apaches (duration: 00m 35s)
21:47:27 Started scap-cdb-rebuild
21:47:31 scap-cdb-rebuild: 100% (in-flight: 0; ok: 225; fail: 0; left: 0)
21:47:31 Finished scap-cdb-rebuild (duration: 00m 03s)
21:47:31 Started sync_wikiversions
21:47:34 sync_wikiversions: 100% (in-flight: 0; ok: 225; fail: 0; left: 0)
21:47:34 Finished sync_wikiversions (duration: 00m 03s)
21:47:34 Started php-fpm-restarts
21:47:34 Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807' on 183 host(s)
21:50:37 php-fpm-restart: 100% (in-flight: 0; ok: 183; fail: 0; left: 0)
21:50:37 Finished php-fpm-restarts (duration: 03m 02s)
21:50:37 Running /usr/local/bin/mwscript purgeMessageBlobStore.php
21:50:38 Finished scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] (duration: 38m 34s)
21:50:38 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=houseblaster', 'Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]]']' returned non-zero exit status 1.

I ended up aborting the backport window because of the errors above trying to deploy the first simple config change in the queue.

Event Timeline

cjming updated the task description. (Show Details)

@Clement_Goubert Looks like this wasn't just a one-off problem.

Mentioned in SAL (#wikimedia-operations) [2024-03-04T23:37:52Z] <dancy@deploy2002> Locking from deployment [mediawiki]: Mediawiki deployments locked pending resolution of T359114

dancy triaged this task as High priority.Mar 4 2024, 11:41 PM
dancy raised the priority of this task from High to Unbreak Now!.Mar 5 2024, 12:09 AM
dancy added a project: serviceops.

Tagging in serviceops to help take a look at this. Setting to UBN since we're pausing deploys pending investigation so k8s hosts won't further diverge from bare metal hosts.

I have mediawiki deployments locked until we know what's going on. You can unlock by killing my scap process on deploy2002. It is pid 3272.

Looking at logstash in the Kubernetes events dashboard and fiddling a bit with the filtering I finally see

0/145 nodes are available: 132 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate, 7 node(s) were unschedulable.

The above is for eqiad, but codfw didn't fare any better

0/136 nodes are available: 1 node(s) were unschedulable, 129 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate.

And those are for mw-api-int (randomly picked)

I 've crafted this dashboard https://grafana-rw.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s&from=now-24h&to=now

to visualize resources in our kubernetes clusters.

It's still under the User Dashboards Grafana folder cause there is a little bit more work to do.

image.png (376×1 px, 19 KB)

Apparently CPU requests went above the allocatable ones. The timing doesn't exactly much (it's almost 1H off), but that maybe because I haven't accounted yet for something in the dashboard.

I 'll be adding some 200 CPUs today (already in the process), that should get us out of the woods.

The alternative is to scale down deployments a bit.

I 've accounted for the cordoned nodes and indeed...

image.png (521×1 px, 23 KB)

akosiaris claimed this task.

I 've added another 220 CPUs for codfw and 300 for eqiad, we should be good on this front. I 'll resolve in the interest of sparing someone else from doing so, feel free to reopen.

Ugh, and I even tested the deployments worked after upping the number of replicas for T357508: Move 60% of mediawiki external requests to mw on k8s.
We have a bigger than necessary mw-api-int deployment at the moment due to T356497: Raise mw-api-int replicas for increased load from mobileapps for T339865: PCS should use parsoid endpoints in MediaWiki, not RESTbase.

I can scale it back a bit, and prepare a future patch in case load from PCS is bigger than expected. That'll give us even more breathing room.

Thanks for the help everyone.