Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (602 w, 5 d)
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- Akosiaris [ Global Accounts ]
Mar 15 2026
Feb 5 2026
Close to 2 years later, and with T353464: Migrate wikikube control planes to hardware nodes done, I don't think we 've seen a recurrence. I 'll boldly resolve
I 've merged the switch to default in listening on IPv6 as well in puppet. In Kubernetes land, all charts are long past mesh.configuration 1.7.
Feb 4 2026
I 'll just closed this as declined. It's close to 6 years, I won't have the time to work on this.
Jan 28 2026
Summarizing from notes during an informal SRE summit session.
Jan 14 2026
Jan 12 2026
Jan 9 2026
Thanks for you work on backporting LilyPond to bookworm-backports, this is useful information.
Dec 23 2025
I 've gone ahead and coded in the fc-list tool in Toolforge an HTML version. It's available under https://fc-list.toolforge.org/fc-list.html
Dec 22 2025
Lowering from UBN to High given the latest update.
Dec 18 2025
Thanks for amendments!
Thanks for this Ben.
Dec 17 2025
Change deployed. The file under noc.wikimedia.org is now a simple informational file. I have on purpose avoided an HTTP redirect or an HTML refresh. Users should update their bookmarks/habbits. I 'd like to eventually get rid of this file from this repo, it was always a hack.
Dec 16 2025
Lowering to high while the analysis and recommendation is being discussed in T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.
Dec 15 2025
Feasibility analysis, pros/cons of the proposals above.
Dec 13 2025
With @ssastry, @cscott and @Krinkle we managed to secure some time on Thursday to look more into this. Aside from @gengh's reproduction above that involves an edit, we managed to reproduce this using one of the pages that Denny has listed above by just issuing simple GET requests (no jobs involved, no edits, etc). In the process of trying to figure out why the active DC parse wasn't sufficient to avoid the issue appearing in the secondary DC, we believe we've uncovered a design flaw in the ParserCache. Either me of @cscott will split this in a different task once we are back from the offsite, but the TL;DR is that a condition exists where we can end up with stale data in the ParserCache, forcing a reparse in the secondary DC. None of this is probably related to wikifunctions specifically, but it exists nevertheless. If this functioned as I thought it would, it would have been hiding the architectural issue of wikifunctions not being designed for Multi-DC.
Dec 12 2025
@effie and @Jgiannelos have an idea to re-utilize mw-experimental for this. I 've submitted a Cross Team engineering proposal for this.
Dec 11 2025
FYI, I 've continued posting updates in T405461
@Jdforrester-WMF @gengh @DSantamaria @DVrandecic I have a product related question. What are the expectations regarding consistency of what viewers see across the world. i.e. how ok is it that someone in Europe/Africa sees older/newer/different content than someone say in the US?
Dec 10 2025
Dec 9 2025
FYI, I implented the idea in T280718#8332540 in a Toolforge hosted tool, fc-list. The tool is at https://fc-list.toolforge.org/fc-list.txt . The data gets updated once a day and the image used is the one that is deployed at the time the data get updated. Code is at https://gitlab.wikimedia.org/toolforge-repos/fc-list, I 've posted a change to make https://noc.wikimedia.org/conf/fc-list inform users of that.
Dec 1 2025
Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated ML machine and with the precautions discussed, we are OK with moving forward and unblocking this.
Nov 28 2025
Nov 27 2025
Resolving per last comment. 2 year old task anyway.
Nov 25 2025
Nov 24 2025
Turnilo for the Telegram Logo (first hit in what @Ladsgroup ) says: Google Proxy as the ISP, in an staggering 85% of the cases. However, it sends those requests with no referrer.
Nov 21 2025
Update: SRE started looking into this. Unfortunately some unexpected issues prevented us from diving deeper. We want to look into the problem in some more detail next week. For now, we are somewhat baffled by the use of the mc-wf hosts for storing the async fragments, this was unexpected.
Nov 17 2025
The 172.16.* hosts are from WMCS. We need to investigate what kind of traffic we're seeing from these hosts further most likely
Nov 14 2025
Nov 12 2025
Summarizing a bit from Slack and IRC
Nov 11 2025
Nov 4 2025
Oct 23 2025
And this is the yamldiff of the same upgrade
Oct 22 2025
Does "we" include every SRE in the SRE department? Or is that only the "tools-infra" team? Or is it "tools-platform" + "tools-infra" teams?
In T407296#11274680, @Andrew wrote:
"After building a pilot bare-metal toolforge replacement, how hard was it, and does it seem like something that would be easy to maintain, or hard to maintain?"
Oct 1 2025
Sep 29 2025
Sep 23 2025
Sep 22 2025
I 've already replied in T394982#11201112, but I find it improbable that SRE will be implementing such a behavior to accommodate for the change in node.js fetch() API. The HTTP Host header is pretty important across the infrastructure. Rewriting other HTTP headers to it might make debugging and reasoning more difficult than needed.
Sep 19 2025
Thanks for this writeup! Couple of inline replies
Sep 18 2025
I still have the same concerns as voiced in T359067#9602091, but I also have to be pragmatic. I don't see us solving the bigger registry problems in the next 6 months
Sep 16 2025
Setting to stalled, while we figure out the exact details of this one.
Sep 5 2025
Sep 4 2025
I 've gone ahead and shaped up some older notes I had in wikitech and posted a guide for capacity planning in https://wikitech.wikimedia.org/wiki/Kubernetes/Capacity_Planning_of_a_Service
Sep 3 2025
Mentioning this here as well as in T403094: Request to increase function-orchestrator memory to 10GiB. I 've gone through the logstash entries and the related kernel logs and events and there are only 2 instances where the kernel OOMKiller showed up and killed a container and even in that case it was the mcrouter container. I think the issues in this task are a combination of
The orchestrator is crashing very frequently now due to OOM
Aug 28 2025
Aug 21 2025
Aug 14 2025
Thanks @claime. Stupid typo on my side.
What does backend refer to here?
Re-reading my first response, I realize I might have been a bit unclear. Indeed my focus was to respond to number 2, namely should it contain the shell username of whoever made the wiki? , not to question sending the email in the first place. I agree that this should still be sent. It's a notification as you say, not an auditing mechanism.
Aug 13 2025
Patches deployed, you should be good to retry @cmassaro
Those are warning (note the W prefix). They wouldn't stop the deployment from happening.
BTW, on the technical side, mw-script does indeed keep the username in the labels of the job and the pod, e.g.
This functionality was added 10 years ago in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaMaintenance/+/a17c2ef30e0e85ced460f304cf481cdb7d924486%5E%21
Jul 18 2025
Jul 17 2025
This was probably due to https://sal.toolforge.org/log/huohd5cB8tZ8Ohr03499
There you go.
Jul 16 2025
We don't have strict requirements around the intra DC availability zones.
Thanks for tagging me in this one. This is more Infrastructure-Foundations territory these days, so I am adding the relevant people as well for their information.
Jul 15 2025
Jul 11 2025
Jul 8 2025
No new reports, I 'll resolve, feel free to reopen.
Jul 7 2025
All kubernetes clusters are now configured to use MTU 1460. This will take some time (weeks) to fully propagate, as this requires a pod restart. Deployments, node maintenance, evictions and other events that end up restarting or rescheduling pods will trigger it. In a few weeks we should be in a position to look at the few left hanging fruits and manually restart those.
wikikube workers repooled.
Jun 30 2025
The 2 patches have been merged and will ride out today's deployments. Hopefully we 'll be able to successfully resolve this task next week.
Jun 27 2025
I 'll file a patch though to increase the maximum bucket.
