User Details
- User Since
- Oct 15 2019, 4:02 PM (339 w, 3 d)
- Availability
- Available
- IRC Nick
- rzl
- LDAP User
- RLazarus
- MediaWiki User
- RLazarus (WMF) [ Global Accounts ]
Wed, Apr 15
On the prerequisites:
- Double-checked, and both mcrouters have all the routes except /local/wf which isn't on mw-mcrouter. That's fine, because...
- Per @Jdforrester-WMF, we don't need to keep the /local/wf default. Nothing in mw-* namespaces, including mw-wikifunctions, uses it. (The orchestrator, running in the wikifunctions namespace does, but that's out of scope here.)
So, the story is that
- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1269038 configures the shared mcrouter daemonset (mw-mcrouter), which is accessible at $_SERVER['MCROUTER_SERVER'] (i.e. 10.64.72.12:4442 in eqiad, 10.192.72.12:4442 in codfw), but
- it isn't having the desired effect, because https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1266290 sets $eqiadDCWFMC and $codfwDCWFMC to talk to 127.0.0.1:11213 (in-pod mcrouter) rather than using the shared mcrouter, and
- that won't work because nothing is listening on that port: outside of mw-wikifunctions, most mediawiki pods don't have an in-pod mcrouter. So when you try to connect to 127.0.0.1:11213, nobody is home.
Fri, Apr 10
Wed, Apr 8
Weirdly, Envoy failed to start after the change, with this in the logs:
Wed, Mar 25
We were actually just talking about this in the SLOs group last week (adding @Vgutierrez and @CDanis).
Tue, Mar 24
In the serviceops meeting today we decided to go ahead with this.
Mon, Mar 23
Fri, Mar 20
Mar 17 2026
(Service Ops triage here: moving this to Needs Info for discussion at our team meeting.)
One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and data plane.
Mar 13 2026
Resolving; the remaining hosts will go straight to 1.35.9 in T419637 instead.
Mar 12 2026
Unsurprisingly /var/log/kern.log on kubestage1006 (hosting the above example pod) is full of lines like:
The actual profiles are at modules/profile/files/kubernetes/node/wikifunctions-evaluator and .../wikifunctions-orchestrator (which is the same except s/evaluator/orchestrator/g). The staging and prod containers have the same apparmor annotation[1] so should have the same signals policy, and anyway it looks correct at a glance.
rzl@deploy2002:~$ kube-env wikifunctions staging
rzl@deploy2002:~$ kubectl describe pod function-evaluator-javascript-evaluator-58c586f4c5-zgzvp
[ output trimmed to just the relevant lines: ]
Containers:
function-evaluator-javascript-evaluator:
Container ID: containerd://31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9
State: Running
function-evaluator-javascript-evaluator-tls-proxy:
Container ID: containerd://ea12e56b08b979b586254f08b8ac7fb9011d9adc3e4370c5d8fd0d0237be2ac3
State: Terminated
Reason: Completed
Exit Code: 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 54m (x303 over 10h) kubelet MountVolume.SetUp failed for volume "tls-certs-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-tls-proxy-certs" not registered
Warning FailedMount 44m (x308 over 10h) kubelet MountVolume.SetUp failed for volume "envoy-config-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-envoy-config-volume" not registered
Normal Killing 14m (x895 over 10h) kubelet Stopping container function-evaluator-javascript-evaluator
Warning FailedKillPod 4m48s (x908 over 10h) kubelet error killing pod: [failed to "KillContainer" for "function-evaluator-javascript-evaluator" with KillContainerError: "rpc error: code = Unknown desc = failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown", failed to "KillPodSandbox" for "fb689e76-3315-4fa2-8157-772ff7a8a45d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown"]Mar 11 2026
Service Ops triage here: Agreed there's nothing for us to do, thanks @ayounsi - untagging us.
Mar 10 2026
This makes tracking response difficult, and in some situations recently we have had to move to #wikimedia-sre in order to communicate properly. If we wish to pursue this as an official pattern, it needs to be documented and recorded.
(ServiceOps bug triage here.)
Mar 2 2026
(Copying from my Gerrit comment:)
Feb 26 2026
Not offhand, but your reading sounds believable to me. I'd also be interested in whether the change to X-Forwarded-For (Envoy wouldn't append the remote address to it anymore) would cause any problems for anyone depending on that, but I wouldn't imagine so.
From the serviceops triage meeting:
Feb 19 2026
rzl@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ charlie --services_dir . list _example_: ml-serve-eqiad, ml-serve-codfw article-descriptions: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw article-models: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw articletopic-outlink: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw edit-check: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw experimental: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw llm: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw logo-detection: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw ores-legacy: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw readability: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw recommendation-api-ng: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revertrisk: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revise-tone-task-generator: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revision-models: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-articlequality: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-articletopic: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-draftquality: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-drafttopic: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-editquality-damaging: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-editquality-goodfaith: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw revscoring-editquality-reverted: ml-staging-codfw, ml-serve-eqiad, ml-serve-codfw
Feb 14 2026
Yeah, I'm fine closing this. I'd still like to address the flapping alerts from intermittent failures someday, maybe with smarter logic around alerting if we fail N in a row, but we don't need to keep this task open to track that. Since the appserver config isn't in Puppet anymore, the situation in T289202#7812998 isn't a factor.
Feb 13 2026
Feb 12 2026
Hi Editing Team -- is this still on your plate for the current work cycle? Let us know how we can help.
We're very close. There are some bare-metal hosts still left to upgrade, I've updated the description (see also Debmonitor).
ServiceOps triage -- closing this as obsolete, as it was fixed at https://gerrit.wikimedia.org/r/1080583.
Triaging this in 2026: We should evaluate if this 503 rate is still high, and go from there.
Triaging this in 2026: We should evaluate if the error rate is still high, and go from there. At a glance, it seems like things might have improved.
Feb 5 2026
Got it, sounds like a fine workaround. :) Closing as above in that case, but I still hope we can make this better all around and I'll let you know how things progress.
Feb 4 2026
Thanks @MLechvien-WMF, sorry @daniel for not responding sooner.
Feb 3 2026
Before the next upgrade, we may want to give charlie the ability to optionally exclude mediawiki services, so that they can be sequenced independently (e.g., via SKIP_DIRS).
Jan 13 2026
Jan 8 2026
Jan 6 2026
Happy new year! The core functionality here is long since complete and in widespread active use. I'm going to resolve this task, which should no longer be used for general mwscript-k8s feedback. There's still work to do, including both UX polish like T387268 and added functionality like T379675, and those tasks remain open to track that work.
Dec 27 2025
Proposing https://gerrit.wikimedia.org/r/1221195 as an interim solution over the break, and once we're back we can either keep it (and resolve this task) or revert it (and abandon this task) or some third thing.
Dec 19 2025
Helm timed out again when I tried to deploy machinetranslation for the next round of envoy upgrades. I'll retitle this task, as the ContainerStatusUnknown pods aren't the cause of the problem, but we still can't run helmfile apply and we should fix that.
Still some hosts remaining to upgrade to 1.35 in T410975, but we don't need this umbrella task open to track the multi-stage upgrade anymore.
Dec 12 2025
@dr0ptp4kt (And cc @herron) Good question!
Dec 3 2025
Envoy 1.35.7 is about to come out, with security fixes: https://groups.google.com/g/envoy-announce/c/zr2OzwmJFqY
Dec 1 2025
The conversation in #wikimedia-serviceops when this was raised:
Nov 26 2025
(Clinic duty here! Apparently a milestone tag, like SRE Observability (FY2025/2026-Q3), is mutually exclusive with the project tag, like SRE Observability, and that means the task shows up on the clinic duty dashboard as "needs triage." I'm adding Observability-Metrics at a guess, because that also takes it off the triage list, but if you'll be using those milestone tags going forward, we may want to adjust the clinic duty dashboard query.)
Nov 25 2025
Added to nda:
rzl@ldap-maint1001:~$ ldapsearch -x cn=nda | grep chandra-wmde member: uid=chandra-wmde,ou=people,dc=wikimedia,dc=org
Oh, and: On top of L3 which you've already read, please ensure you're also familiar with https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#User_responsibilities and reach out if you have any questions. Thanks!