During the UTC late backport window on March 4, 2024, running scap backport on the first patch got stuck deploying to test servers. Helm rollbacks were reported in my terminal and K8s deployments failed:
21:35:39 Finished Running helmfile -e eqiad --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 12s) 21:35:39 K8s deployment to stage canaries failed: K8s deployment had the following errors: codfw: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. eqiad: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. 21:35:39 Rolling back to prior state...
I tried to sync and encountered the following errors in my terminal while trying to scap backport on deploy2002.codfw.wmnet:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-php-fpm-exporter
image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3
imagePullPolicy: IfNotPresent
args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"]
ports:
- name: fpm-metrics
containerPort: 9118
livenessProbe:
tcpSocket:
port: 9118
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
# TODO: understand how to make mcrouter use the
# application CA when connecting to memcached via TLS
- name: mediawiki-main-mcrouter
image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022
imagePullPolicy: IfNotPresent
env:
- name: PORT
value: "11213"
- name: CONFIG
value: "file:/etc/mcrouter/config.json"
- name: ROUTE_PREFIX
value: "eqiad/mw"
- name: CROSS_REGION_TO
value: "250"
- name: CROSS_CLUSTER_TO
value: "100"
- name: NUM_PROXIES
value: "5"
- name: PROBE_TIMEOUT
value: "60000"
- name: TIMEOUTS_UNTIL_TKO
value: "3"
# We don't want to listen to TLS here.
# TODO: check if it can connect with TLS without the TLS settings.
- name: USE_SSL
value: "no"
ports:
# Please note: this port is not exposed outside of the pod.
- name: mcrouter
containerPort: 11213
livenessProbe:
tcpSocket:
port: mcrouter
readinessProbe:
exec:
command:
- /bin/healthz
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
volumeMounts:
- name: mediawiki-main-mcrouter
mountPath: /etc/mcrouter
resources:
requests:
cpu: 200m
memory: 100Mi
limits:
cpu: 700m
memory: 200Mi
- name: mediawiki-main-mcrouter-exporter
image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010
imagePullPolicy: IfNotPresent
args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ]
ports:
# Port names are limited to 15 characters.
- name: mcr-metrics
containerPort: 9151
livenessProbe:
tcpSocket:
port: mcr-metrics
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-tls-proxy
image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203
imagePullPolicy: IfNotPresent
env:
- name: SERVICE_NAME
value: main
- name: SERVICE_ZONE
value: "default"
- name: CONCURRENCY
value: "12"
- name: ADMIN_PORT
value: "1666"
- name: DRAIN_TIME_S
value: "600"
- name: DRAIN_STRATEGY
value: gradual
ports:
- containerPort: 4447
readinessProbe:
httpGet:
path: /healthz
port: 9361
volumeMounts:
- name: envoy-config-volume
mountPath: /etc/envoy/
readOnly: true
- name: tls-certs-volume
mountPath: /etc/envoy/ssl
readOnly: true
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 7"
resources:
limits:
memory: 350Mi
requests:
cpu: 200m
memory: 100Mi
- name: mediawiki-main-rsyslog
image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
imagePullPolicy: IfNotPresent
env:
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: KUBERNETES_NODE
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: KUBERNETES_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KUBERNETES_RELEASE
valueFrom:
fieldRef:
fieldPath: metadata.labels['release']
- name: KUBERNETES_DEPLOYMENT
valueFrom:
fieldRef:
fieldPath: metadata.labels['deployment']
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 1
memory: 300Mi
volumeMounts:
# Mount the shared socket volume
- name: mediawiki-main-rsyslog-config
mountPath: /etc/rsyslog.d
- name: php-logging
mountPath: /var/log/php-fpm
# We need to read files created by www-data and with mode
# 0600
securityContext:
runAsUser: 33
- name: statsd-exporter
image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 100M
limits:
cpu: 200m
memory: 200M
volumeMounts:
- name: statsd-config
mountPath: /etc/monitoring
readOnly: true
ports:
- name: statsd-metrics
containerPort: 9102
livenessProbe:
tcpSocket:
port: statsd-metrics
volumes:
# Apache sites
- name: mediawiki-main-httpd-sites
configMap:
name: mediawiki-main-httpd-sites-config
# Additional httpd debug configuration
- name: mediawiki-main-httpd-early
configMap:
name: mediawiki-main-httpd-early-config
# Datacenter
- name: mediawiki-main-wikimedia-cluster
configMap:
name: mediawiki-main-wikimedia-cluster-config
# sendmail configuration
- name: mediawiki-main-mail
configMap:
name: mediawiki-main-mail-config
# TLS configurations
- name: envoy-config-volume
configMap:
name: mediawiki-main-envoy-config-volume
- name: tls-certs-volume
secret:
secretName: mediawiki-main-tls-proxy-certs
# Shared unix socket for php apps
- name: shared-socket
emptyDir: {}
- name: mediawiki-main-wmerrors
configMap:
name: mediawiki-main-wmerrors
# Mcrouter configuration if any
# Mcrouter configuration
- name: mediawiki-main-mcrouter
configMap:
name: mediawiki-main-mcrouter-config
- name: mediawiki-main-rsyslog-config
configMap:
name: mediawiki-main-rsyslog-config
- name: php-logging
emptyDir: {}
# GeoIP data
- name: mediawiki-main-geoip
hostPath:
path: /usr/share/GeoIP
- name: mediawiki-main-geoipinfo
hostPath:
path: /usr/share/GeoIPInfo
- name: statsd-config
configMap:
name: mediawiki-main-statsd-configmap
skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki
FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:
PATH:
/usr/bin/helm3
ARGS:
0: helm3 (5 bytes)
1: upgrade (7 bytes)
2: --install (9 bytes)
3: --reset-values (14 bytes)
4: main (4 bytes)
5: wmf-stable/mediawiki (20 bytes)
6: --timeout (9 bytes)
7: 600s (4 bytes)
8: --atomic (8 bytes)
9: --namespace (11 bytes)
10: mw-api-ext (10 bytes)
11: --values (8 bytes)
12: /tmp/values568600304 (20 bytes)
13: --values (8 bytes)
14: /tmp/values225672591 (20 bytes)
15: --values (8 bytes)
16: /tmp/values217764258 (20 bytes)
17: --values (8 bytes)
18: /tmp/values300505753 (20 bytes)
19: --values (8 bytes)
20: /tmp/values705690404 (20 bytes)
21: --values (8 bytes)
22: /tmp/values972405299 (20 bytes)
23: --values (8 bytes)
24: /tmp/values533020150 (20 bytes)
25: --values (8 bytes)
26: /tmp/values099846365 (20 bytes)
27: --values (8 bytes)
28: /tmp/values399121048 (20 bytes)
29: --values (8 bytes)
30: /tmp/values643515671 (20 bytes)
31: --values (8 bytes)
32: /tmp/values480956810 (20 bytes)
33: --set (5 bytes)
34: php.servergroup=kube-mw-api-ext (31 bytes)
35: --history-max (13 bytes)
36: 10 (2 bytes)
37: --kubeconfig (12 bytes)
38: /etc/kubernetes/mw-api-ext-deploy-eqiad.config (46 bytes)
ERROR:
exit status 1
EXIT STATUS
1
STDERR:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
COMBINED OUTPUT:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
21:46:40 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 14s)
21:46:41 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
21:46:41 Stdout/stderr follows:
21:46:41 helmfile.yaml: basePath=.
skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=main, chart=wmf-stable/mediawiki
mw-parsoid, mw-parsoid.eqiad.main, Deployment (apps) has changed:
# Source: mediawiki/templates/deployment.yaml.tpl
apiVersion: apps/v1
kind: Deployment
metadata:
name: mw-parsoid.eqiad.main
labels:
app: mediawiki
chart: mediawiki-0.6.20
release: main
heritage: Helm
spec:
selector:
matchLabels:
app: mediawiki
release: main
replicas: 33
strategy:
rollingUpdate:
maxSurge: 10%
maxUnavailable: 10%
type: RollingUpdate
template:
metadata:
labels:
app: mediawiki
release: main
deployment: mw-parsoid
routed_via: main
annotations:
checksum/sites: 797f6a05f681fdc18612dd35d27a01257cd6ca744bd52dffd1ef2925469359d2
checksum/rsyslog: 7d7443ef8588edd1f9f2f08939c56be288d0b687d759209056c89cdcb1d5936d
prometheus.io/scrape_by_name: "true"
checksum/prometheus-statsd: 1b90efc073a70aa3d97b1f2280969873e2dbc1a35c735da94580288291fb7214
checksum/tls-config: 03480fc8ed9a428202856c734f64eb185080b05a86c056c8bb258dfff704de87
envoyproxy.io/scrape: "true"
envoyproxy.io/port: "9361"
spec:
# TODO: add affinity rules to ensure even distribution across rows
nodeSelector:
node.kubernetes.io/disk-type: ssd
terminationGracePeriodSeconds: 10
containers:
### The apache httpd container
# TODO: set up logging. See T265876
# TODO: fix virtualhosts in puppet so that the port is set to APACHE_RUN_PORT
- name: mediawiki-main-httpd
- image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-02-29-215143-webserver
+ image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-03-04-211219-webserver
imagePullPolicy: IfNotPresent
env:
- name: FCGI_MODE
value: FCGI_UNIX
- name: SERVERGROUP
value: kube-mw-parsoid
- name: APACHE_RUN_PORT
value: "8080"
# Set the pod name as the value of the Server: header.
- name: SERVER_SIGNATURE
valueFrom:
fieldRef:
fieldPath: metadata.name
# Log in ecs format
- name: LOG_FORMAT
value: ecs_rsyslog
# Do not log monitoring requests
- name: LOG_SKIP_SYSTEM
value: "1"
ports:
- name: httpd
containerPort: 8080
# PHP monitoring port
- name: php-metrics
containerPort: 9181
livenessProbe:
tcpSocket:
port: httpd
readinessProbe:
httpGet:
# this is the simplest php script you can think of - it just returns OK.
# This way, we're just testing that apache + php-fpm are ready.
# mcrouter, if enabled, should have its own readiness probe probably.
path: /healthz
port: php-metrics
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 200m
memory: 200Mi
limits:
cpu: 500m
memory: 400Mi
volumeMounts:
# Mount the shared socket volume
- name: shared-socket
mountPath: /run/shared
# Note: we use subpaths here. Given subpaths are implemented with bind mounts,
# they won't be updated when the configmap is updated.
# This is ok because we're re-deploying the pods when that happens.
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/02-redirects.conf
subPath: 02-redirects.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/03-main.conf
subPath: 03-main.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/04-remnant.conf
subPath: 04-remnant.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/06-secure.wikimedia.conf
subPath: 06-secure.wikimedia.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/07-wikimania.conf
subPath: 07-wikimania.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/08-foundation.conf
subPath: 08-foundation.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/09-wikimedia.conf
subPath: 09-wikimedia.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/10-www.wikimedia.org.conf
subPath: 10-www.wikimedia.org.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/00-nonexistent.conf
subPath: 00-nonexistent.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/01-wwwportals.conf
subPath: 01-wwwportals.conf
# Allow us to inject configurations *before* everything else is evaluated
# To this end we also pick a non-descriptive name that OTOH guarantees
# the configuration will be loaded soon.
# See apache.conf in the mediawiki-httpd image to see precisely when this is loaded.
- name: mediawiki-main-httpd-early
mountPath: /etc/apache2/conf-enabled/00-aaa.conf
subPath: 00-aaa.conf
### The MediaWiki container
- name: mediawiki-main-app
- image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-02-29-215132-publish
+ image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-03-04-211208-publish
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["SYS_PTRACE"] # This is needed to produce a slow log
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
env:
- name: SERVERGROUP
value: kube-mw-parsoid
- name: FCGI_MODE
value: FCGI_UNIX
- name: FCGI_URL
value: "unix:///run/shared/fpm-www.sock"
- name: PHP__opcache__memory_consumption
value: "500"
- name: PHP__opcache__max_accelerated_files
value: "32531"
- name: PHP__opcache__interned_strings_buffer
value: "50"
- name: PHP__auto_prepend_file
value: "/srv/mediawiki/wmf-config/PhpAutoPrepend.php"
- name: FPM__request_terminate_timeout
value: "60"
- name: PHP__apc__shm_size
value: 768M
- name: FPM__pm__max_children
value: "8"
- name: FPM__request_slowlog_timeout
value: "5"
- name: PHP__display_errors
value: "Off"
- name: PHP__error_reporting
value: "30719"
- name: PHP__pcre__backtrack_limit
value: "5000000"
- name: PHP__max_execution_time
value: "210"
- name: PHP__error_log
value: /var/log/php-fpm/error.log
- name: FCGI_ALLOW
value: "127.0.0.1"
- name: FPM__slowlog
value: /var/log/php-fpm/slowlog.log
# See T276908
livenessProbe:
exec:
command:
- /usr/bin/test
- -S
- /run/shared/fpm-www.sock
initialDelaySeconds: 1
periodSeconds: 5
resources:
requests:
# CPU calculation:
# Minimum 1 whole CPU
# Multiply the amount of cpu_per_worker (float, unit: cpu, ex: 0.5 is half a CPU per worker)
# by the number of configured workers + 1 (to take into account the main php-fpm process)
cpu: 4.5
# RAM calculation:
# Multiply 50% of the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)
# Add 50% of the opcache size and the apc size (close to the average real consumption)
memory: 2634Mi
limits:
# RAM calculation:
# Multiply the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)
# Add 50% of the opcache size and the apc size (close to the average real consumption)
memory: 4634Mi
volumeMounts:
# TODO: use an env variable for this.
- name: mediawiki-main-wikimedia-cluster
mountPath: /etc/wikimedia-cluster
subPath: wikimedia-cluster
# mount the volume as the www-data home. This allows us to change
# the configuration of this configmap without causing a mediawiki
# redeployment as well, which we'd need if we were using a subpath.
- name: mediawiki-main-mail
mountPath: /var/www
# Mount the shared socket volume
- name: shared-socket
mountPath: /run/shared
# php-wmerrors configuration
- name: mediawiki-main-wmerrors
mountPath: /etc/wmerrors
- name: php-logging
mountPath: /var/log/php-fpm
# GeoIP data
- name: mediawiki-main-geoip
mountPath: /usr/share/GeoIP/
readOnly: true
- name: mediawiki-main-geoipinfo
mountPath: /usr/share/GeoIPInfo/
readOnly: true
# Add the following exporters:
# php-fpm exporter
# apache exporter on port 9117
- name: mediawiki-main-httpd-exporter
image: docker-registry.discovery.wmnet/prometheus-apache-exporter:0.0.3
imagePullPolicy: IfNotPresent
args: ["-scrape_uri", "http://127.0.0.1:9181/server-status?auto"]
ports:
- name: httpd-metrics
containerPort: 9117
livenessProbe:
tcpSocket:
port: 9117
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-php-fpm-exporter
image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3
imagePullPolicy: IfNotPresent
args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"]
ports:
- name: fpm-metrics
containerPort: 9118
livenessProbe:
tcpSocket:
port: 9118
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
# TODO: understand how to make mcrouter use the
# application CA when connecting to memcached via TLS
- name: mediawiki-main-mcrouter
image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022
imagePullPolicy: IfNotPresent
env:
- name: PORT
value: "11213"
- name: CONFIG
value: "file:/etc/mcrouter/config.json"
- name: ROUTE_PREFIX
value: "eqiad/mw"
- name: CROSS_REGION_TO
value: "250"
- name: CROSS_CLUSTER_TO
value: "100"
- name: NUM_PROXIES
value: "5"
- name: PROBE_TIMEOUT
value: "60000"
- name: TIMEOUTS_UNTIL_TKO
value: "3"
# We don't want to listen to TLS here.
# TODO: check if it can connect with TLS without the TLS settings.
- name: USE_SSL
value: "no"
ports:
# Please note: this port is not exposed outside of the pod.
- name: mcrouter
containerPort: 11213
livenessProbe:
tcpSocket:
port: mcrouter
readinessProbe:
exec:
command:
- /bin/healthz
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
volumeMounts:
- name: mediawiki-main-mcrouter
mountPath: /etc/mcrouter
resources:
requests:
cpu: 200m
memory: 100Mi
limits:
cpu: 700m
memory: 200Mi
- name: mediawiki-main-mcrouter-exporter
image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010
imagePullPolicy: IfNotPresent
args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ]
ports:
# Port names are limited to 15 characters.
- name: mcr-metrics
containerPort: 9151
livenessProbe:
tcpSocket:
port: mcr-metrics
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-tls-proxy
image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203
imagePullPolicy: IfNotPresent
env:
- name: SERVICE_NAME
value: main
- name: SERVICE_ZONE
value: "default"
- name: CONCURRENCY
value: "12"
- name: ADMIN_PORT
value: "1666"
- name: DRAIN_TIME_S
value: "600"
- name: DRAIN_STRATEGY
value: gradual
ports:
- containerPort: 4452
readinessProbe:
httpGet:
path: /healthz
port: 9361
volumeMounts:
- name: envoy-config-volume
mountPath: /etc/envoy/
readOnly: true
- name: tls-certs-volume
mountPath: /etc/envoy/ssl
readOnly: true
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 7"
resources:
limits:
memory: 350Mi
requests:
cpu: 200m
memory: 100Mi
- name: mediawiki-main-rsyslog
image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
imagePullPolicy: IfNotPresent
env:
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: KUBERNETES_NODE
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: KUBERNETES_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KUBERNETES_RELEASE
valueFrom:
fieldRef:
fieldPath: metadata.labels['release']
- name: KUBERNETES_DEPLOYMENT
valueFrom:
fieldRef:
fieldPath: metadata.labels['deployment']
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 1
memory: 300Mi
volumeMounts:
# Mount the shared socket volume
- name: mediawiki-main-rsyslog-config
mountPath: /etc/rsyslog.d
- name: php-logging
mountPath: /var/log/php-fpm
# We need to read files created by www-data and with mode
# 0600
securityContext:
runAsUser: 33
- name: statsd-exporter
image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 100M
limits:
cpu: 200m
memory: 200M
volumeMounts:
- name: statsd-config
mountPath: /etc/monitoring
readOnly: true
ports:
- name: statsd-metrics
containerPort: 9102
livenessProbe:
tcpSocket:
port: statsd-metrics
volumes:
# Apache sites
- name: mediawiki-main-httpd-sites
configMap:
name: mediawiki-main-httpd-sites-config
# Additional httpd debug configuration
- name: mediawiki-main-httpd-early
configMap:
name: mediawiki-main-httpd-early-config
# Datacenter
- name: mediawiki-main-wikimedia-cluster
configMap:
name: mediawiki-main-wikimedia-cluster-config
# sendmail configuration
- name: mediawiki-main-mail
configMap:
name: mediawiki-main-mail-config
# TLS configurations
- name: envoy-config-volume
configMap:
name: mediawiki-main-envoy-config-volume
- name: tls-certs-volume
secret:
secretName: mediawiki-main-tls-proxy-certs
# Shared unix socket for php apps
- name: shared-socket
emptyDir: {}
- name: mediawiki-main-wmerrors
configMap:
name: mediawiki-main-wmerrors
# Mcrouter configuration if any
# Mcrouter configuration
- name: mediawiki-main-mcrouter
configMap:
name: mediawiki-main-mcrouter-config
- name: mediawiki-main-rsyslog-config
configMap:
name: mediawiki-main-rsyslog-config
- name: php-logging
emptyDir: {}
# GeoIP data
- name: mediawiki-main-geoip
hostPath:
path: /usr/share/GeoIP
- name: mediawiki-main-geoipinfo
hostPath:
path: /usr/share/GeoIPInfo
- name: statsd-config
configMap:
name: mediawiki-main-statsd-configmap
skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki
FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:
PATH:
/usr/bin/helm3
ARGS:
0: helm3 (5 bytes)
1: upgrade (7 bytes)
2: --install (9 bytes)
3: --reset-values (14 bytes)
4: main (4 bytes)
5: wmf-stable/mediawiki (20 bytes)
6: --timeout (9 bytes)
7: 600s (4 bytes)
8: --atomic (8 bytes)
9: --namespace (11 bytes)
10: mw-parsoid (10 bytes)
11: --values (8 bytes)
12: /tmp/values618884843 (20 bytes)
13: --values (8 bytes)
14: /tmp/values401253198 (20 bytes)
15: --values (8 bytes)
16: /tmp/values183007317 (20 bytes)
17: --values (8 bytes)
18: /tmp/values921299376 (20 bytes)
19: --values (8 bytes)
20: /tmp/values029269839 (20 bytes)
21: --values (8 bytes)
22: /tmp/values952661090 (20 bytes)
23: --values (8 bytes)
24: /tmp/values893283417 (20 bytes)
25: --values (8 bytes)
26: /tmp/values685514724 (20 bytes)
27: --values (8 bytes)
28: /tmp/values397940723 (20 bytes)
29: --values (8 bytes)
30: /tmp/values223566006 (20 bytes)
31: --set (5 bytes)
32: php.servergroup=kube-mw-parsoid (31 bytes)
33: --history-max (13 bytes)
34: 10 (2 bytes)
35: --kubeconfig (12 bytes)
36: /etc/kubernetes/mw-parsoid-deploy-eqiad.config (46 bytes)
ERROR:
exit status 1
EXIT STATUS
1
STDERR:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
COMBINED OUTPUT:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
21:46:41 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 15s)
21:46:41 K8s deployment to stage production failed: K8s deployment had the following errors:
codfw: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
eqiad: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
21:46:41 Rolling back to prior state...
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s)
21:46:44 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 03s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s)
21:46:45 Rollback completed
21:46:45 Finished sync-prod-k8s (duration: 10m 20s)
21:46:45 Started sync-proxies
21:46:52 sync-proxies: 100% (in-flight: 0; ok: 8; fail: 0; left: 0)
21:46:52 Per-host sync duration: average 5.3s, median 5.3s
21:46:52 rsync transfer: average 381,975 bytes/host, total 3,055,800 bytes
21:46:52 Finished sync-proxies (duration: 00m 06s)
21:46:52 Started sync-apaches
21:47:27 sync-apaches: 100% (in-flight: 0; ok: 209; fail: 0; left: 0)
21:47:27 Per-host sync duration: average 5.7s, median 4.9s
21:47:27 rsync transfer: average 381,975 bytes/host, total 79,832,775 bytes
21:47:27 Finished sync-apaches (duration: 00m 35s)
21:47:27 Started scap-cdb-rebuild
21:47:31 scap-cdb-rebuild: 100% (in-flight: 0; ok: 225; fail: 0; left: 0)
21:47:31 Finished scap-cdb-rebuild (duration: 00m 03s)
21:47:31 Started sync_wikiversions
21:47:34 sync_wikiversions: 100% (in-flight: 0; ok: 225; fail: 0; left: 0)
21:47:34 Finished sync_wikiversions (duration: 00m 03s)
21:47:34 Started php-fpm-restarts
21:47:34 Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807' on 183 host(s)
21:50:37 php-fpm-restart: 100% (in-flight: 0; ok: 183; fail: 0; left: 0)
21:50:37 Finished php-fpm-restarts (duration: 03m 02s)
21:50:37 Running /usr/local/bin/mwscript purgeMessageBlobStore.php
21:50:38 Finished scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] (duration: 38m 34s)
21:50:38 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=houseblaster', 'Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]]']' returned non-zero exit status 1.I ended up aborting the backport window because of the errors above trying to deploy the first simple config change in the queue.

