User Details
- User Since
- Oct 15 2019, 4:02 PM (347 w, 1 d)
- Availability
- Available
- IRC Nick
- rzl
- LDAP User
- RLazarus
- MediaWiki User
- RLazarus (WMF) [ Global Accounts ]
Today
Great!
My fault, sorry about that!
Yesterday
This is allowed, but you need to use deploy (read-write) credentials. Try it after
Done! Please wait up to 30 minutes for that to propagate to all servers, then you should be all set. I'll resolve this but feel free to reopen it if you have any trouble using your new access.
Chatted with @Scott_French about this today. The cause is Sophroid's http probe in service.yaml, routed to a port that doesn't handle HTTP, only GRPC.
Mon, Jun 8
This is done! Wait up to 30 minutes for it to propagate to all hosts, and you'll be all set. Resolving the task, feel free to reopen it if you have any trouble.
Hi @APDube-WMF! I see you provided an SSH key on the task, but if Superset access is all you need, we won't actually need it. I'll set you up without SSH access for now, but if you also need access to anything listed at level 2 or higher, like the stat servers, let me know, and we can always add your SSH key later.
Never mind! Imagine my surprise to find that key already there. :) This was done in https://gerrit.wikimedia.org/r/1298282, thanks @ssingh!
Verified out of band, updating.
Hi @mahmoud.abdelsattar.wmde! I see you already have restricted access (using the SSH key you included), so we should be able to additionally grant you analytics-privatedata-users. (For anyone following along, that effectively means level 2 access, even though only level 1 is strictly required for Superset.)
All done! This will take up to 30 minutes to roll out everywhere, then make sure to follow the instructions in T428262#11990183.
SSH public key (must be a separate key from Wikimedia cloud SSH access): N/A (already in modules/admin/data/data.yaml)
I'm always happy to get involved if you need me, but my colleague @CDanis from the SLO working group is your best contact for the followup here. :)
Thu, Jun 4
Moving to Pending until the app-side work is done on the orchestrator and evaluator, then we can wrap this up. In order:
Word from the Abstract Wikipedia folks is that you should go ahead and reimage the mc-wf hosts without any prework -- just, one at a time please.
Tue, Jun 2
I've reserved port 4974 for this, the next available service port after 4970 (evaluator) and 4971 (orchestrator main port).
Mon, Jun 1
Thu, May 28
From the logs:
Fri, May 22
Let's leave it open until we move it from "draft" to "approved." :)
Thu, May 21
Built and published:
Mon, May 18
I chatted with @MoritzMuehlenhoff about this today (thanks Moritz).
Thu, May 14
May 11 2026
Relevant log lines from this one:
We actually added a test for that case already (in T387549, for this incident):
May 9 2026
I think listing it in that incident was a mistake, actually -- there weren't any releases in state failed in that event, so this feature wouldn't have affected things at all. (I think the incident author wanted a one-line "roll back mediawiki without having to touch the charts repo" command, and thought from the task title that's what this task is. I'm not sure if I agree that feature would be a good idea, but it's not the same thing being discussed here.)
May 8 2026
Transient failure, followed by a successful run, resolving.
Transient failure, followed by a successful run, resolving.
May 6 2026
Posting as serviceops triage: Adding Releng for awareness (who don't own the schedule-deployment tool but do manage other deployment calendar automation).
May 4 2026
Apr 24 2026
Thanks for digging into it!
Suppose version A is running, and we're deploying version B.
Apr 21 2026
Apr 20 2026
I agree that we should take the explode call out of the hot path, but reworking the existing env variables is probably more than we want to tackle -- we could orchestrate the config change for ourselves, but it'd also be a visible change for non-WMF users of MediaWiki with memcache. Not impossible but a lot of work, especially if we don't end up keeping MemcachedWrapper in the long run.
Apr 15 2026
On the prerequisites:
- Double-checked, and mw-mcrouter has all the routes except /local/wf. That's fine, because...
- Per @Jdforrester-WMF, we don't need to keep the /local/wf default. Nothing in mw-* namespaces, including mw-wikifunctions, uses it. (The orchestrator, running in the wikifunctions namespace does, but that's out of scope here.)
So, the story is that
- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1269038 configures the shared mcrouter daemonset (mw-mcrouter), which is accessible at $_SERVER['MCROUTER_SERVER'] (i.e. 10.64.72.12:4442 in eqiad, 10.192.72.12:4442 in codfw), but
- it isn't having the desired effect, because https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1266290 sets $eqiadDCWFMC and $codfwDCWFMC to talk to 127.0.0.1:11213 (in-pod mcrouter) rather than using the shared mcrouter, and
- that won't work because nothing is listening on that port: outside of mw-wikifunctions, most mediawiki pods don't have an in-pod mcrouter. So when you try to connect to 127.0.0.1:11213, nobody is home.
Apr 10 2026
Apr 8 2026
Weirdly, Envoy failed to start after the change, with this in the logs:
Mar 25 2026
We were actually just talking about this in the SLOs group last week (adding @Vgutierrez and @CDanis).
Mar 24 2026
In the serviceops meeting today we decided to go ahead with this.
Mar 23 2026
Mar 20 2026
Mar 17 2026
(Service Ops triage here: moving this to Needs Info for discussion at our team meeting.)
One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and data plane.
Mar 13 2026
Resolving; the remaining hosts will go straight to 1.35.9 in T419637 instead.
Mar 12 2026
Unsurprisingly /var/log/kern.log on kubestage1006 (hosting the above example pod) is full of lines like:
The actual profiles are at modules/profile/files/kubernetes/node/wikifunctions-evaluator and .../wikifunctions-orchestrator (which is the same except s/evaluator/orchestrator/g). The staging and prod containers have the same apparmor annotation[1] so should have the same signals policy, and anyway it looks correct at a glance.
rzl@deploy2002:~$ kube-env wikifunctions staging
rzl@deploy2002:~$ kubectl describe pod function-evaluator-javascript-evaluator-58c586f4c5-zgzvp
[ output trimmed to just the relevant lines: ]
Containers:
function-evaluator-javascript-evaluator:
Container ID: containerd://31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9
State: Running
function-evaluator-javascript-evaluator-tls-proxy:
Container ID: containerd://ea12e56b08b979b586254f08b8ac7fb9011d9adc3e4370c5d8fd0d0237be2ac3
State: Terminated
Reason: Completed
Exit Code: 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 54m (x303 over 10h) kubelet MountVolume.SetUp failed for volume "tls-certs-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-tls-proxy-certs" not registered
Warning FailedMount 44m (x308 over 10h) kubelet MountVolume.SetUp failed for volume "envoy-config-volume" : object "wikifunctions"/"function-evaluator-javascript-evaluator-envoy-config-volume" not registered
Normal Killing 14m (x895 over 10h) kubelet Stopping container function-evaluator-javascript-evaluator
Warning FailedKillPod 4m48s (x908 over 10h) kubelet error killing pod: [failed to "KillContainer" for "function-evaluator-javascript-evaluator" with KillContainerError: "rpc error: code = Unknown desc = failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown", failed to "KillPodSandbox" for "fb689e76-3315-4fa2-8157-772ff7a8a45d" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": failed to kill container \"31d8b176b3d2c8055e2a6c0d353eb9cf3964a4deb8e322c22316098b6adc6eb9\": unknown error after kill: runc did not terminate successfully: exit status 1: unable to signal init: permission denied\n: unknown"]Mar 11 2026
Service Ops triage here: Agreed there's nothing for us to do, thanks @ayounsi - untagging us.
Mar 10 2026
This makes tracking response difficult, and in some situations recently we have had to move to #wikimedia-sre in order to communicate properly. If we wish to pursue this as an official pattern, it needs to be documented and recorded.