Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (579 w, 5 d)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Yesterday
Tue, Nov 4
Thu, Oct 23
And this is the yamldiff of the same upgrade
Wed, Oct 22
Does "we" include every SRE in the SRE department? Or is that only the "tools-infra" team? Or is it "tools-platform" + "tools-infra" teams?
In T407296#11274680, @Andrew wrote:
"After building a pilot bare-metal toolforge replacement, how hard was it, and does it seem like something that would be easy to maintain, or hard to maintain?"
Oct 1 2025
Sep 29 2025
Sep 23 2025
Sep 22 2025
I 've already replied in T394982#11201112, but I find it improbable that SRE will be implementing such a behavior to accommodate for the change in node.js fetch() API. The HTTP Host header is pretty important across the infrastructure. Rewriting other HTTP headers to it might make debugging and reasoning more difficult than needed.
Sep 19 2025
Thanks for this writeup! Couple of inline replies
Sep 18 2025
I still have the same concerns as voiced in T359067#9602091, but I also have to be pragmatic. I don't see us solving the bigger registry problems in the next 6 months
Sep 16 2025
Setting to stalled, while we figure out the exact details of this one.
Sep 5 2025
Sep 4 2025
I 've gone ahead and shaped up some older notes I had in wikitech and posted a guide for capacity planning in https://wikitech.wikimedia.org/wiki/Kubernetes/Capacity_Planning_of_a_Service
Sep 3 2025
Mentioning this here as well as in T403094: Request to increase function-orchestrator memory to 10GiB. I 've gone through the logstash entries and the related kernel logs and events and there are only 2 instances where the kernel OOMKiller showed up and killed a container and even in that case it was the mcrouter container. I think the issues in this task are a combination of
The orchestrator is crashing very frequently now due to OOM
Aug 28 2025
Aug 21 2025
Aug 14 2025
Thanks @claime. Stupid typo on my side.
What does backend refer to here?
Re-reading my first response, I realize I might have been a bit unclear. Indeed my focus was to respond to number 2, namely should it contain the shell username of whoever made the wiki? , not to question sending the email in the first place. I agree that this should still be sent. It's a notification as you say, not an auditing mechanism.
Aug 13 2025
Patches deployed, you should be good to retry @cmassaro
Those are warning (note the W prefix). They wouldn't stop the deployment from happening.
BTW, on the technical side, mw-script does indeed keep the username in the labels of the job and the pod, e.g.
This functionality was added 10 years ago in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaMaintenance/+/a17c2ef30e0e85ced460f304cf481cdb7d924486%5E%21
Jul 18 2025
Jul 17 2025
This was probably due to https://sal.toolforge.org/log/huohd5cB8tZ8Ohr03499
There you go.
Jul 16 2025
We don't have strict requirements around the intra DC availability zones.
Thanks for tagging me in this one. This is more Infrastructure-Foundations territory these days, so I am adding the relevant people as well for their information.
Jul 15 2025
Jul 11 2025
Jul 8 2025
No new reports, I 'll resolve, feel free to reopen.
Jul 7 2025
All kubernetes clusters are now configured to use MTU 1460. This will take some time (weeks) to fully propagate, as this requires a pod restart. Deployments, node maintenance, evictions and other events that end up restarting or rescheduling pods will trigger it. In a few weeks we should be in a position to look at the few left hanging fruits and manually restart those.
wikikube workers repooled.
Jun 30 2025
The 2 patches have been merged and will ride out today's deployments. Hopefully we 'll be able to successfully resolve this task next week.
Jun 27 2025
I 'll file a patch though to increase the maximum bucket.
Jun 26 2025
I have https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1164269/2 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1164270/2 lined up for deployment on Monday, they should resolve what is witnessed by deployers.
Jun 24 2025
Response from the redfish API, using my gerrit change above to print out the response
Jun 23 2025
Our first arm64 server just got racked. We 'll need to figure out how to incorporate it in our tooling, see T397653, but we are finally moving.
Copy pasting from the commit message of the change just merged and deployed
Judging from the lack of comments in the last 2 weeks and repeated tests by yours truly, there is a chance the approach worked on the mwdebug hosts (which are slated for removal anyway), so moving forward this week with the mwdebug on k8s as well.
Jun 19 2025
Jun 18 2025
Jun 16 2025
Jun 13 2025
Thanks, this helped. After creating manually the registry-restricted (docker-registry will return 503 if it doesn't exist) bucket with s3cmd, and some temporary hacks to prove this works, I managed to partially push an image.
Jun 11 2025
I 've gone ahead and switch all of aux-k8s to MTU 1460. This time around, I went for a more hands off approach, namely:
Jun 10 2025
Tried deploying this, see https://sal.toolforge.org/log/drexRZcBvg159pQr5-r8 and logs have the following
Indeed, I commented in the wrong task. My mistake, thanks.
I 've run a battery of tests in the previous days against mwdebug2001 and mwdebug2002, having 2 different configurations regarding retry_on policy. mwdebug2001 had connect-failure and mwdebug2002 had 5xx. The latter includes the former for what is worth. The tests were just invocations of the command in T380958#10887212. mwdebug2001 had the occasional fail transaction in a number of those tests, mwdebug2002 has consistently not return an error. Thus, https://gerrit.wikimedia.org/r/1155117 which I just merged. Let it it soak for the week. Should this work OK, it needs to be backported to the configuration of mwdebug on k8s as well.
Jun 6 2025
Logs have the following
E20250606 14:46:30.010773 1 Server-inl.h:593] mcrouter error (router name '11213', flavor 'unknown', service 'mcrouter'): Failed to configure, initial error 'Failed to reconfigure: Unknown RouteHandle: KeyModifyRoute line: 81', from backup 'Failed to reconfigure: Unknown RouteHandle: KeyModifyRoute line: 81'
Thanks for this! I 've also landed a round of updates today in https://wikitech.wikimedia.org/w/index.php?title=SLO/Template_instructions/Dashboards_and_alerts&diff=prev&oldid=2309464.
Switch to stalled while waiting for the release.
Jun 5 2025
Close to 100k requests later to Main_Page of spcom.wikimedia.org, 1 error. This is below very close 99.999% btw. Siege rounds up to 100%, but actual availability is 99,998% which is very very very (did I say very enough?) good.
I 'll be hammering mwdebug2001 with siege for the next 30 minutes in an effort to reproduce this. Command is run from deploy1003 and it's
Jun 4 2025
This is surfacing once every couple of days or so, at least per Logstash, which counts 23 instances in the last 2 months. I was looking at one that @Marostegui met today.
May 28 2025
May 26 2025
In the interesting of stopping the non actioned upon alerts, I 've gone ahead and excluded linkrecommendation from this alert. Patch is https://gerrit.wikimedia.org/r/c/operations/alerts/+/1150726. I 've also it from serviceops to serviceops-radar. Feel free to undo when the time comes that someone works on it.
If you want to do some testing, I could set you up with a test account on apus.
1 single bucket, at least at the beginning. Reading https://distribution.github.io/distribution/about/configuration/, I don't think the software can use more than 1 bucket anyway. We could in the future, assuming the PoC ends up being successful, to discuss splitting strategies to have >1 bucket. But even in that case, since that requires running an extra instance of the software, per my current understanding at least, we are limited to a small number, probably in the single digits.
May 23 2025
I 've spent a good deal of time today to do what I assumed to be easy, that is perform the above in our ingressgateways. I have failed up to now. I 'll need to revisit this with a fresh mind, cause right now I even have doubts the ISTIO_METAJSON_STATS trick works.
After having to deal a bit with a staging-eqiad calico upgrade yesterday, I did find 1 thing that will break. This is a bit complex:
May 22 2025
I did run the simple one
May 21 2025
staging-eqiad with an MTU of 1460 as well.
May 20 2025
nobody@wmfdebug:/$ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0@if69: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether ae:16:b8:c3:6d:75 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Unfortunately the 2 patches above didn't work. For ml-staging-codfw, just because it's still, via virtual of helmfile.d/admin_ng/values/common.yaml, locked to 0.2.10. However, it did not work either for staging-codfw because while the upstream manifests do indeed have support, that support is implemented via having the CNI config being managed by calico, a support that we did not import in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1112058, sticking with CALICO_MANAGE_CNI: false. We do manage this via puppet. It should be noted that our puppet implementation does allow to differentiate per cluster, same as the chart approach. There is no really functional difference between the 2 ways.

