During the UTC late backport window on 3/4/24, running scap backport on the first patch got stuck deploying to test servers. Helm rollbacks were reported in my terminal and K8s deployments failed:
```
21:35:39 Finished Running helmfile -e eqiad --selector name=canary apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 12s)
21:35:39 K8s deployment to stage canaries failed: K8s deployment had the following errors:
codfw: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
eqiad: Deployment of mw-jobrunner-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1.
21:35:39 Rolling back to prior state...
```
I tried to sync and encountered the following errors in my terminal while trying to scap backport on deploy2002.codfw.wmnet:
```
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-php-fpm-exporter
image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3
imagePullPolicy: IfNotPresent
args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"]
ports:
- name: fpm-metrics
containerPort: 9118
livenessProbe:
tcpSocket:
port: 9118
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
# TODO: understand how to make mcrouter use the
# application CA when connecting to memcached via TLS
- name: mediawiki-main-mcrouter
image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022
imagePullPolicy: IfNotPresent
env:
- name: PORT
value: "11213"
- name: CONFIG
value: "file:/etc/mcrouter/config.json"
- name: ROUTE_PREFIX
value: "eqiad/mw"
- name: CROSS_REGION_TO
value: "250"
- name: CROSS_CLUSTER_TO
value: "100"
- name: NUM_PROXIES
value: "5"
- name: PROBE_TIMEOUT
value: "60000"
- name: TIMEOUTS_UNTIL_TKO
value: "3"
# We don't want to listen to TLS here.
# TODO: check if it can connect with TLS without the TLS settings.
- name: USE_SSL
value: "no"
ports:
# Please note: this port is not exposed outside of the pod.
- name: mcrouter
containerPort: 11213
livenessProbe:
tcpSocket:
port: mcrouter
readinessProbe:
exec:
command:
- /bin/healthz
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
volumeMounts:
- name: mediawiki-main-mcrouter
mountPath: /etc/mcrouter
resources:
requests:
cpu: 200m
memory: 100Mi
limits:
cpu: 700m
memory: 200Mi
- name: mediawiki-main-mcrouter-exporter
image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010
imagePullPolicy: IfNotPresent
args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ]
ports:
# Port names are limited to 15 characters.
- name: mcr-metrics
containerPort: 9151
livenessProbe:
tcpSocket:
port: mcr-metrics
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-tls-proxy
image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203
imagePullPolicy: IfNotPresent
env:
- name: SERVICE_NAME
value: main
- name: SERVICE_ZONE
value: "default"
- name: CONCURRENCY
value: "12"
- name: ADMIN_PORT
value: "1666"
- name: DRAIN_TIME_S
value: "600"
- name: DRAIN_STRATEGY
value: gradual
ports:
- containerPort: 4447
readinessProbe:
httpGet:
path: /healthz
port: 9361
volumeMounts:
- name: envoy-config-volume
mountPath: /etc/envoy/
readOnly: true
- name: tls-certs-volume
mountPath: /etc/envoy/ssl
readOnly: true
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 7"
resources:
limits:
memory: 350Mi
requests:
cpu: 200m
memory: 100Mi
- name: mediawiki-main-rsyslog
image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
imagePullPolicy: IfNotPresent
env:
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: KUBERNETES_NODE
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: KUBERNETES_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KUBERNETES_RELEASE
valueFrom:
fieldRef:
fieldPath: metadata.labels['release']
- name: KUBERNETES_DEPLOYMENT
valueFrom:
fieldRef:
fieldPath: metadata.labels['deployment']
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 1
memory: 300Mi
volumeMounts:
# Mount the shared socket volume
- name: mediawiki-main-rsyslog-config
mountPath: /etc/rsyslog.d
- name: php-logging
mountPath: /var/log/php-fpm
# We need to read files created by www-data and with mode
# 0600
securityContext:
runAsUser: 33
- name: statsd-exporter
image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 100M
limits:
cpu: 200m
memory: 200M
volumeMounts:
- name: statsd-config
mountPath: /etc/monitoring
readOnly: true
ports:
- name: statsd-metrics
containerPort: 9102
livenessProbe:
tcpSocket:
port: statsd-metrics
volumes:
# Apache sites
- name: mediawiki-main-httpd-sites
configMap:
name: mediawiki-main-httpd-sites-config
# Additional httpd debug configuration
- name: mediawiki-main-httpd-early
configMap:
name: mediawiki-main-httpd-early-config
# Datacenter
- name: mediawiki-main-wikimedia-cluster
configMap:
name: mediawiki-main-wikimedia-cluster-config
# sendmail configuration
- name: mediawiki-main-mail
configMap:
name: mediawiki-main-mail-config
# TLS configurations
- name: envoy-config-volume
configMap:
name: mediawiki-main-envoy-config-volume
- name: tls-certs-volume
secret:
secretName: mediawiki-main-tls-proxy-certs
# Shared unix socket for php apps
- name: shared-socket
emptyDir: {}
- name: mediawiki-main-wmerrors
configMap:
name: mediawiki-main-wmerrors
# Mcrouter configuration if any
# Mcrouter configuration
- name: mediawiki-main-mcrouter
configMap:
name: mediawiki-main-mcrouter-config
- name: mediawiki-main-rsyslog-config
configMap:
name: mediawiki-main-rsyslog-config
- name: php-logging
emptyDir: {}
# GeoIP data
- name: mediawiki-main-geoip
hostPath:
path: /usr/share/GeoIP
- name: mediawiki-main-geoipinfo
hostPath:
path: /usr/share/GeoIPInfo
- name: statsd-config
configMap:
name: mediawiki-main-statsd-configmap
skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki
FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:
PATH:
/usr/bin/helm3
ARGS:
0: helm3 (5 bytes)
1: upgrade (7 bytes)
2: --install (9 bytes)
3: --reset-values (14 bytes)
4: main (4 bytes)
5: wmf-stable/mediawiki (20 bytes)
6: --timeout (9 bytes)
7: 600s (4 bytes)
8: --atomic (8 bytes)
9: --namespace (11 bytes)
10: mw-api-ext (10 bytes)
11: --values (8 bytes)
12: /tmp/values568600304 (20 bytes)
13: --values (8 bytes)
14: /tmp/values225672591 (20 bytes)
15: --values (8 bytes)
16: /tmp/values217764258 (20 bytes)
17: --values (8 bytes)
18: /tmp/values300505753 (20 bytes)
19: --values (8 bytes)
20: /tmp/values705690404 (20 bytes)
21: --values (8 bytes)
22: /tmp/values972405299 (20 bytes)
23: --values (8 bytes)
24: /tmp/values533020150 (20 bytes)
25: --values (8 bytes)
26: /tmp/values099846365 (20 bytes)
27: --values (8 bytes)
28: /tmp/values399121048 (20 bytes)
29: --values (8 bytes)
30: /tmp/values643515671 (20 bytes)
31: --values (8 bytes)
32: /tmp/values480956810 (20 bytes)
33: --set (5 bytes)
34: php.servergroup=kube-mw-api-ext (31 bytes)
35: --history-max (13 bytes)
36: 10 (2 bytes)
37: --kubeconfig (12 bytes)
38: /etc/kubernetes/mw-api-ext-deploy-eqiad.config (46 bytes)
ERROR:
exit status 1
EXIT STATUS
1
STDERR:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
COMBINED OUTPUT:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-api-ext-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
21:46:40 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 10m 14s)
21:46:41 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
21:46:41 Stdout/stderr follows:
21:46:41 helmfile.yaml: basePath=.
skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=main, chart=wmf-stable/mediawiki
mw-parsoid, mw-parsoid.eqiad.main, Deployment (apps) has changed:
# Source: mediawiki/templates/deployment.yaml.tpl
apiVersion: apps/v1
kind: Deployment
metadata:
name: mw-parsoid.eqiad.main
labels:
app: mediawiki
chart: mediawiki-0.6.20
release: main
heritage: Helm
spec:
selector:
matchLabels:
app: mediawiki
release: main
replicas: 33
strategy:
rollingUpdate:
maxSurge: 10%
maxUnavailable: 10%
type: RollingUpdate
template:
metadata:
labels:
app: mediawiki
release: main
deployment: mw-parsoid
routed_via: main
annotations:
checksum/sites: 797f6a05f681fdc18612dd35d27a01257cd6ca744bd52dffd1ef2925469359d2
checksum/rsyslog: 7d7443ef8588edd1f9f2f08939c56be288d0b687d759209056c89cdcb1d5936d
prometheus.io/scrape_by_name: "true"
checksum/prometheus-statsd: 1b90efc073a70aa3d97b1f2280969873e2dbc1a35c735da94580288291fb7214
checksum/tls-config: 03480fc8ed9a428202856c734f64eb185080b05a86c056c8bb258dfff704de87
envoyproxy.io/scrape: "true"
envoyproxy.io/port: "9361"
spec:
# TODO: add affinity rules to ensure even distribution across rows
nodeSelector:
node.kubernetes.io/disk-type: ssd
terminationGracePeriodSeconds: 10
containers:
### The apache httpd container
# TODO: set up logging. See T265876
# TODO: fix virtualhosts in puppet so that the port is set to APACHE_RUN_PORT
- name: mediawiki-main-httpd
- image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-02-29-215143-webserver
+ image: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-03-04-211219-webserver
imagePullPolicy: IfNotPresent
env:
- name: FCGI_MODE
value: FCGI_UNIX
- name: SERVERGROUP
value: kube-mw-parsoid
- name: APACHE_RUN_PORT
value: "8080"
# Set the pod name as the value of the Server: header.
- name: SERVER_SIGNATURE
valueFrom:
fieldRef:
fieldPath: metadata.name
# Log in ecs format
- name: LOG_FORMAT
value: ecs_rsyslog
# Do not log monitoring requests
- name: LOG_SKIP_SYSTEM
value: "1"
ports:
- name: httpd
containerPort: 8080
# PHP monitoring port
- name: php-metrics
containerPort: 9181
livenessProbe:
tcpSocket:
port: httpd
readinessProbe:
httpGet:
# this is the simplest php script you can think of - it just returns OK.
# This way, we're just testing that apache + php-fpm are ready.
# mcrouter, if enabled, should have its own readiness probe probably.
path: /healthz
port: php-metrics
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 200m
memory: 200Mi
limits:
cpu: 500m
memory: 400Mi
volumeMounts:
# Mount the shared socket volume
- name: shared-socket
mountPath: /run/shared
# Note: we use subpaths here. Given subpaths are implemented with bind mounts,
# they won't be updated when the configmap is updated.
# This is ok because we're re-deploying the pods when that happens.
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/02-redirects.conf
subPath: 02-redirects.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/03-main.conf
subPath: 03-main.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/04-remnant.conf
subPath: 04-remnant.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/06-secure.wikimedia.conf
subPath: 06-secure.wikimedia.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/07-wikimania.conf
subPath: 07-wikimania.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/08-foundation.conf
subPath: 08-foundation.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/09-wikimedia.conf
subPath: 09-wikimedia.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/10-www.wikimedia.org.conf
subPath: 10-www.wikimedia.org.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/00-nonexistent.conf
subPath: 00-nonexistent.conf
- name: mediawiki-main-httpd-sites
mountPath: /etc/apache2/sites-enabled/01-wwwportals.conf
subPath: 01-wwwportals.conf
# Allow us to inject configurations *before* everything else is evaluated
# To this end we also pick a non-descriptive name that OTOH guarantees
# the configuration will be loaded soon.
# See apache.conf in the mediawiki-httpd image to see precisely when this is loaded.
- name: mediawiki-main-httpd-early
mountPath: /etc/apache2/conf-enabled/00-aaa.conf
subPath: 00-aaa.conf
### The MediaWiki container
- name: mediawiki-main-app
- image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-02-29-215132-publish
+ image: docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-03-04-211208-publish
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["SYS_PTRACE"] # This is needed to produce a slow log
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
env:
- name: SERVERGROUP
value: kube-mw-parsoid
- name: FCGI_MODE
value: FCGI_UNIX
- name: FCGI_URL
value: "unix:///run/shared/fpm-www.sock"
- name: PHP__opcache__memory_consumption
value: "500"
- name: PHP__opcache__max_accelerated_files
value: "32531"
- name: PHP__opcache__interned_strings_buffer
value: "50"
- name: PHP__auto_prepend_file
value: "/srv/mediawiki/wmf-config/PhpAutoPrepend.php"
- name: FPM__request_terminate_timeout
value: "60"
- name: PHP__apc__shm_size
value: 768M
- name: FPM__pm__max_children
value: "8"
- name: FPM__request_slowlog_timeout
value: "5"
- name: PHP__display_errors
value: "Off"
- name: PHP__error_reporting
value: "30719"
- name: PHP__pcre__backtrack_limit
value: "5000000"
- name: PHP__max_execution_time
value: "210"
- name: PHP__error_log
value: /var/log/php-fpm/error.log
- name: FCGI_ALLOW
value: "127.0.0.1"
- name: FPM__slowlog
value: /var/log/php-fpm/slowlog.log
# See T276908
livenessProbe:
exec:
command:
- /usr/bin/test
- -S
- /run/shared/fpm-www.sock
initialDelaySeconds: 1
periodSeconds: 5
resources:
requests:
# CPU calculation:
# Minimum 1 whole CPU
# Multiply the amount of cpu_per_worker (float, unit: cpu, ex: 0.5 is half a CPU per worker)
# by the number of configured workers + 1 (to take into account the main php-fpm process)
cpu: 4.5
# RAM calculation:
# Multiply 50% of the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)
# Add 50% of the opcache size and the apc size (close to the average real consumption)
memory: 2634Mi
limits:
# RAM calculation:
# Multiply the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)
# Add 50% of the opcache size and the apc size (close to the average real consumption)
memory: 4634Mi
volumeMounts:
# TODO: use an env variable for this.
- name: mediawiki-main-wikimedia-cluster
mountPath: /etc/wikimedia-cluster
subPath: wikimedia-cluster
# mount the volume as the www-data home. This allows us to change
# the configuration of this configmap without causing a mediawiki
# redeployment as well, which we'd need if we were using a subpath.
- name: mediawiki-main-mail
mountPath: /var/www
# Mount the shared socket volume
- name: shared-socket
mountPath: /run/shared
# php-wmerrors configuration
- name: mediawiki-main-wmerrors
mountPath: /etc/wmerrors
- name: php-logging
mountPath: /var/log/php-fpm
# GeoIP data
- name: mediawiki-main-geoip
mountPath: /usr/share/GeoIP/
readOnly: true
- name: mediawiki-main-geoipinfo
mountPath: /usr/share/GeoIPInfo/
readOnly: true
# Add the following exporters:
# php-fpm exporter
# apache exporter on port 9117
- name: mediawiki-main-httpd-exporter
image: docker-registry.discovery.wmnet/prometheus-apache-exporter:0.0.3
imagePullPolicy: IfNotPresent
args: ["-scrape_uri", "http://127.0.0.1:9181/server-status?auto"]
ports:
- name: httpd-metrics
containerPort: 9117
livenessProbe:
tcpSocket:
port: 9117
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-php-fpm-exporter
image: docker-registry.discovery.wmnet/prometheus-php-fpm-exporter:0.0.3
imagePullPolicy: IfNotPresent
args: ["--endpoint=http://127.0.0.1:9181/fpm-status", "--addr=0.0.0.0:9118"]
ports:
- name: fpm-metrics
containerPort: 9118
livenessProbe:
tcpSocket:
port: 9118
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
# TODO: understand how to make mcrouter use the
# application CA when connecting to memcached via TLS
- name: mediawiki-main-mcrouter
image: docker-registry.discovery.wmnet/mcrouter:0.41.0-4-20231022
imagePullPolicy: IfNotPresent
env:
- name: PORT
value: "11213"
- name: CONFIG
value: "file:/etc/mcrouter/config.json"
- name: ROUTE_PREFIX
value: "eqiad/mw"
- name: CROSS_REGION_TO
value: "250"
- name: CROSS_CLUSTER_TO
value: "100"
- name: NUM_PROXIES
value: "5"
- name: PROBE_TIMEOUT
value: "60000"
- name: TIMEOUTS_UNTIL_TKO
value: "3"
# We don't want to listen to TLS here.
# TODO: check if it can connect with TLS without the TLS settings.
- name: USE_SSL
value: "no"
ports:
# Please note: this port is not exposed outside of the pod.
- name: mcrouter
containerPort: 11213
livenessProbe:
tcpSocket:
port: mcrouter
readinessProbe:
exec:
command:
- /bin/healthz
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
volumeMounts:
- name: mediawiki-main-mcrouter
mountPath: /etc/mcrouter
resources:
requests:
cpu: 200m
memory: 100Mi
limits:
cpu: 700m
memory: 200Mi
- name: mediawiki-main-mcrouter-exporter
image: docker-registry.discovery.wmnet/prometheus-mcrouter-exporter:0.0.1-2-20211010
imagePullPolicy: IfNotPresent
args: ["--mcrouter.address", "127.0.0.1:11213", "-mcrouter.server_metrics", "-web.listen-address", ":9151" ]
ports:
# Port names are limited to 15 characters.
- name: mcr-metrics
containerPort: 9151
livenessProbe:
tcpSocket:
port: mcr-metrics
resources:
requests:
cpu: 100m
memory: 50Mi
limits:
cpu: 500m
memory: 200Mi
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
- name: mediawiki-main-tls-proxy
image: docker-registry.discovery.wmnet/envoy:1.23.10-2-s4-20231203
imagePullPolicy: IfNotPresent
env:
- name: SERVICE_NAME
value: main
- name: SERVICE_ZONE
value: "default"
- name: CONCURRENCY
value: "12"
- name: ADMIN_PORT
value: "1666"
- name: DRAIN_TIME_S
value: "600"
- name: DRAIN_STRATEGY
value: gradual
ports:
- containerPort: 4452
readinessProbe:
httpGet:
path: /healthz
port: 9361
volumeMounts:
- name: envoy-config-volume
mountPath: /etc/envoy/
readOnly: true
- name: tls-certs-volume
mountPath: /etc/envoy/ssl
readOnly: true
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 7"
resources:
limits:
memory: 350Mi
requests:
cpu: 200m
memory: 100Mi
- name: mediawiki-main-rsyslog
image: docker-registry.discovery.wmnet/rsyslog:8.2102.0-3
imagePullPolicy: IfNotPresent
env:
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: KUBERNETES_NODE
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: KUBERNETES_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: KUBERNETES_RELEASE
valueFrom:
fieldRef:
fieldPath: metadata.labels['release']
- name: KUBERNETES_DEPLOYMENT
valueFrom:
fieldRef:
fieldPath: metadata.labels['deployment']
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 1
memory: 300Mi
volumeMounts:
# Mount the shared socket volume
- name: mediawiki-main-rsyslog-config
mountPath: /etc/rsyslog.d
- name: php-logging
mountPath: /var/log/php-fpm
# We need to read files created by www-data and with mode
# 0600
securityContext:
runAsUser: 33
- name: statsd-exporter
image: docker-registry.discovery.wmnet/prometheus-statsd-exporter:0.0.10-20231022
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 8"
resources:
requests:
cpu: 100m
memory: 100M
limits:
cpu: 200m
memory: 200M
volumeMounts:
- name: statsd-config
mountPath: /etc/monitoring
readOnly: true
ports:
- name: statsd-metrics
containerPort: 9102
livenessProbe:
tcpSocket:
port: statsd-metrics
volumes:
# Apache sites
- name: mediawiki-main-httpd-sites
configMap:
name: mediawiki-main-httpd-sites-config
# Additional httpd debug configuration
- name: mediawiki-main-httpd-early
configMap:
name: mediawiki-main-httpd-early-config
# Datacenter
- name: mediawiki-main-wikimedia-cluster
configMap:
name: mediawiki-main-wikimedia-cluster-config
# sendmail configuration
- name: mediawiki-main-mail
configMap:
name: mediawiki-main-mail-config
# TLS configurations
- name: envoy-config-volume
configMap:
name: mediawiki-main-envoy-config-volume
- name: tls-certs-volume
secret:
secretName: mediawiki-main-tls-proxy-certs
# Shared unix socket for php apps
- name: shared-socket
emptyDir: {}
- name: mediawiki-main-wmerrors
configMap:
name: mediawiki-main-wmerrors
# Mcrouter configuration if any
# Mcrouter configuration
- name: mediawiki-main-mcrouter
configMap:
name: mediawiki-main-mcrouter-config
- name: mediawiki-main-rsyslog-config
configMap:
name: mediawiki-main-rsyslog-config
- name: php-logging
emptyDir: {}
# GeoIP data
- name: mediawiki-main-geoip
hostPath:
path: /usr/share/GeoIP
- name: mediawiki-main-geoipinfo
hostPath:
path: /usr/share/GeoIPInfo
- name: statsd-config
configMap:
name: mediawiki-main-statsd-configmap
skipping missing values file matching "/etc/helmfile-defaults/private/main_services/mw-parsoid/eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Upgrading release=main, chart=wmf-stable/mediawiki
FAILED RELEASES:
NAME
main
helmfile.yaml: basePath=.
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:
PATH:
/usr/bin/helm3
ARGS:
0: helm3 (5 bytes)
1: upgrade (7 bytes)
2: --install (9 bytes)
3: --reset-values (14 bytes)
4: main (4 bytes)
5: wmf-stable/mediawiki (20 bytes)
6: --timeout (9 bytes)
7: 600s (4 bytes)
8: --atomic (8 bytes)
9: --namespace (11 bytes)
10: mw-parsoid (10 bytes)
11: --values (8 bytes)
12: /tmp/values618884843 (20 bytes)
13: --values (8 bytes)
14: /tmp/values401253198 (20 bytes)
15: --values (8 bytes)
16: /tmp/values183007317 (20 bytes)
17: --values (8 bytes)
18: /tmp/values921299376 (20 bytes)
19: --values (8 bytes)
20: /tmp/values029269839 (20 bytes)
21: --values (8 bytes)
22: /tmp/values952661090 (20 bytes)
23: --values (8 bytes)
24: /tmp/values893283417 (20 bytes)
25: --values (8 bytes)
26: /tmp/values685514724 (20 bytes)
27: --values (8 bytes)
28: /tmp/values397940723 (20 bytes)
29: --values (8 bytes)
30: /tmp/values223566006 (20 bytes)
31: --set (5 bytes)
32: php.servergroup=kube-mw-parsoid (31 bytes)
33: --history-max (13 bytes)
34: 10 (2 bytes)
35: --kubeconfig (12 bytes)
36: /etc/kubernetes/mw-parsoid-deploy-eqiad.config (46 bytes)
ERROR:
exit status 1
EXIT STATUS
1
STDERR:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
COMBINED OUTPUT:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
Error: UPGRADE FAILED: release main failed, and has been rolled back due to atomic being set: timed out waiting for the condition
21:46:41 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 10m 15s)
21:46:41 K8s deployment to stage production failed: K8s deployment had the following errors:
codfw: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-main failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
eqiad: Deployment of mw-jobrunner-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-wikifunctions-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-int-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-web-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-api-ext-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
Deployment of mw-parsoid-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
21:46:41 Rolling back to prior state...
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web
21:46:41 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner
21:46:42 Started Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions
21:46:42 Started Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s)
21:46:44 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s)
21:46:44 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-jobrunner (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-int (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-wikifunctions (duration: 00m 02s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-api-ext (duration: 00m 03s)
21:46:45 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-web (duration: 00m 02s)
21:46:45 Rollback completed
21:46:45 Finished sync-prod-k8s (duration: 10m 20s)
21:46:45 Started sync-proxies
21:46:52 sync-proxies: 100% (in-flight: 0; ok: 8; fail: 0; left: 0)
21:46:52 Per-host sync duration: average 5.3s, median 5.3s
21:46:52 rsync transfer: average 381,975 bytes/host, total 3,055,800 bytes
21:46:52 Finished sync-proxies (duration: 00m 06s)
21:46:52 Started sync-apaches
21:47:27 sync-apaches: 100% (in-flight: 0; ok: 209; fail: 0; left: 0)
21:47:27 Per-host sync duration: average 5.7s, median 4.9s
21:47:27 rsync transfer: average 381,975 bytes/host, total 79,832,775 bytes
21:47:27 Finished sync-apaches (duration: 00m 35s)
21:47:27 Started scap-cdb-rebuild
21:47:31 scap-cdb-rebuild: 100% (in-flight: 0; ok: 225; fail: 0; left: 0)
21:47:31 Finished scap-cdb-rebuild (duration: 00m 03s)
21:47:31 Started sync_wikiversions
21:47:34 sync_wikiversions: 100% (in-flight: 0; ok: 225; fail: 0; left: 0)
21:47:34 Finished sync_wikiversions (duration: 00m 03s)
21:47:34 Started php-fpm-restarts
21:47:34 Running '/usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807' on 183 host(s)
21:50:37 php-fpm-restart: 100% (in-flight: 0; ok: 183; fail: 0; left: 0)
21:50:37 Finished php-fpm-restarts (duration: 03m 02s)
21:50:37 Running /usr/local/bin/mwscript purgeMessageBlobStore.php
21:50:38 Finished scap: Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]] (duration: 38m 34s)
21:50:38 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=houseblaster', 'Backport for [[gerrit:994831|InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] (T355462)]]']' returned non-zero exit status 1.
```
I ended up aborting the backport window because of the errors above trying to deploy the first simple config change in the queue.