Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability Engineer

Projects (28)

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Oct 3 2014, 8:40 AM (602 w, 5 d)
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
Akosiaris [ Global Accounts ]

Blurb

Recent Activity

Mar 15 2026

taavi removed the administrator role from akosiaris.
Mar 15 2026, 4:41 PM

Feb 5 2026

akosiaris closed T358936: Kubernetes apiserver probe failures on restart as Resolved.

Close to 2 years later, and with T353464: Migrate wikikube control planes to hardware nodes done, I don't think we 've seen a recurrence. I 'll boldly resolve

Feb 5 2026, 12:57 PM · ServiceOps new, Prod-Kubernetes, SRE
akosiaris closed T255568: Envoy should listen on ipv6 and ipv4 as Resolved.

I 've merged the switch to default in listening on IPv6 as well in puppet. In Kubernetes land, all charts are long past mesh.configuration 1.7.

Feb 5 2026, 12:45 PM · Patch-For-Review, envoy, observability, serviceops-deprecated

Feb 4 2026

akosiaris closed T226237: Investigate outgoing discarded packets in the codfw kubernetes cluster as Declined.

I 'll just closed this as declined. It's close to 6 years, I won't have the time to work on this.

Feb 4 2026, 4:39 PM · serviceops-deprecated
akosiaris placed T390251: docker-registry.wikimedia.org keeps serving bad blobs up for grabs.
Feb 4 2026, 4:38 PM · ServiceOps new
akosiaris placed T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system up for grabs.
Feb 4 2026, 4:38 PM · ChangeProp, WMF-JobQueue, serviceops-radar, OKR-Work

Jan 28 2026

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Summarizing from notes during an informal SRE summit session.

Jan 28 2026, 3:06 PM · ServiceOps new, Patch-For-Review, Prod-Kubernetes, Kubernetes, Traffic

Jan 14 2026

akosiaris updated the task description for T414460: Socket leaking on some dse-k8s row C & D hosts.
Jan 14 2026, 5:30 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), Essential-Work, SRE, netops, Infrastructure-Foundations

Jan 12 2026

akosiaris added a comment to T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

As far as I understand it, there's some push back on "fixing" T412742: MultiWriteBagOStuff surprising behavior when secondary is unreplicated but primary is replicated at all, with the idea that the ultimate solution is to split the primary as well as the secondary and have a fully-split parser cache. The only work needed to do that, as I understand it, is to generalize the ParsoidCachePrewarmJob to work with the legacy parser as well, so as to maintain a sufficient "warmness" in the non-primary data center to ensure seamless DC migration.

Jan 12 2026, 5:02 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work
akosiaris added a comment to T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

Thanks for the write-up, Alexandros — it sounds like there was a fair bit of discussion in Lisbon; sorry we missed it. We've given our quick thoughts inline. It feels like there's a lot of confusion here. Perhaps we should discuss this in the Performance working group every second Monday?

  1. Force Wikifunctions to be single DC

[…]

One caveat: It will ONLY solve the problem for www.wikifunctions.org. Any project that includes functions, will remain vulnerable to the problems outlined in this task. But it should unblock the team from proceeding with their plans involving www.wikifunctions.org. Any work/goals/hypothesis ofc that involved other projects, will not benefit from this.

You're right, we agree that this fix would only help the issue for the client mode operating on Wikifunctions.org, whereas this split-brain problem arises on all of the >150 prod wikis that currently have access to embedded Wikifunctions calls (see the dblist), and so why we halted the roll-outs. We worry that making this change will just be another piece of technical debt we'll later have to undo or work around.

Jan 12 2026, 4:53 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Jan 9 2026

akosiaris placed T385404: Deploy LilyPond 2.24 with Cairo support to shellbox containers up for grabs.

Thanks for you work on backporting LilyPond to bookworm-backports, this is useful information.

Jan 9 2026, 9:29 AM · ServiceOps-Upgrades-Hardware, ServiceOps-Services-Oids, Shellbox, ServiceOps new, Upstream, Wikimedia-SVG-rendering, MediaWiki-extensions-Score

Dec 23 2025

akosiaris added a comment to T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice.

I 've gone ahead and coded in the fc-list tool in Toolforge an HTML version. It's available under https://fc-list.toolforge.org/fc-list.html

Dec 23 2025, 7:57 AM · noc.wikimedia.org, serviceops-radar, Wikimedia-SVG-rendering

Dec 22 2025

akosiaris lowered the priority of T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem from Unbreak Now! to High.

Lowering from UBN to High given the latest update.

Dec 22 2025, 3:50 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Dec 18 2025

akosiaris added a comment to T413080: Design and build the next generation of container-registry service for the WMF production realm.

Thanks for amendments!

Dec 18 2025, 3:17 PM · ServiceOps new, Epic, Ceph, Kubernetes, Infrastructure-Foundations, Data-Platform-SRE, Machine-Learning-Team
akosiaris added a comment to T413080: Design and build the next generation of container-registry service for the WMF production realm.

There is also an opportunity here to try to consolidate the technology used for container registries around the foundation, for example there is also the Toolforge one. If in the long term we end up using the same stack for both should simplify their operations and maintenance and allow to have more people familiar with the stack.

Dec 18 2025, 2:29 PM · ServiceOps new, Epic, Ceph, Kubernetes, Infrastructure-Foundations, Data-Platform-SRE, Machine-Learning-Team
akosiaris added a comment to T413080: Design and build the next generation of container-registry service for the WMF production realm.

Thanks for this Ben.

Dec 18 2025, 2:16 PM · ServiceOps new, Epic, Ceph, Kubernetes, Infrastructure-Foundations, Data-Platform-SRE, Machine-Learning-Team

Dec 17 2025

akosiaris closed T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice as Resolved.

Change deployed. The file under noc.wikimedia.org is now a simple informational file. I have on purpose avoided an HTTP redirect or an HTML refresh. Users should update their bookmarks/habbits. I 'd like to eventually get rid of this file from this repo, it was always a hack.

Dec 17 2025, 8:15 AM · noc.wikimedia.org, serviceops-radar, Wikimedia-SVG-rendering
akosiaris renamed T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice from Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive to Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice.
Dec 17 2025, 8:05 AM · noc.wikimedia.org, serviceops-radar, Wikimedia-SVG-rendering

Dec 16 2025

akosiaris added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

@akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could be feasible? I can take care of kicking it off and finding volunteers (sounds like me and Scott are already in :D).

Dec 16 2025, 2:52 PM · ServiceOps new
akosiaris lowered the priority of T411807: WF memcached service is dc-local but used for dc-global content from Unbreak Now! to High.

Lowering to high while the analysis and recommendation is being discussed in T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

Dec 16 2025, 10:22 AM · Abstract Wikipedia, ServiceOps new, ServiceOps-Datastores, Patch-For-Review, Abstract Wikipedia team (26Q3 (Jan–Mar)), Wikifunctions, Essential-Work

Dec 15 2025

akosiaris updated subscribers of T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

Feasibility analysis, pros/cons of the proposals above.

Dec 15 2025, 11:16 AM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Dec 13 2025

akosiaris updated subscribers of T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

With @ssastry, @cscott and @Krinkle we managed to secure some time on Thursday to look more into this. Aside from @gengh's reproduction above that involves an edit, we managed to reproduce this using one of the pages that Denny has listed above by just issuing simple GET requests (no jobs involved, no edits, etc). In the process of trying to figure out why the active DC parse wasn't sufficient to avoid the issue appearing in the secondary DC, we believe we've uncovered a design flaw in the ParserCache. Either me of @cscott will split this in a different task once we are back from the offsite, but the TL;DR is that a condition exists where we can end up with stale data in the ParserCache, forcing a reparse in the secondary DC. None of this is probably related to wikifunctions specifically, but it exists nevertheless. If this functioned as I thought it would, it would have been hiding the architectural issue of wikifunctions not being designed for Multi-DC.

Dec 13 2025, 11:10 AM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Dec 12 2025

akosiaris updated subscribers of T386246: Migrate parsoidtest functionality to kubernetes (mw-parsoid).

@effie and @Jgiannelos have an idea to re-utilize mw-experimental for this. I 've submitted a Cross Team engineering proposal for this.

Dec 12 2025, 4:46 PM · User-jijiki, ServiceOps-Services-Oids, ServiceOps new, Content-Transform-Team, OKR-Work

Dec 11 2025

akosiaris added a comment to T411807: WF memcached service is dc-local but used for dc-global content.

FYI, I 've continued posting updates in T405461

Dec 11 2025, 10:32 AM · Abstract Wikipedia, ServiceOps new, ServiceOps-Datastores, Patch-For-Review, Abstract Wikipedia team (26Q3 (Jan–Mar)), Wikifunctions, Essential-Work
akosiaris added a comment to T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

@Jdforrester-WMF @gengh @DSantamaria @DVrandecic I have a product related question. What are the expectations regarding consistency of what viewers see across the world. i.e. how ok is it that someone in Europe/Africa sees older/newer/different content than someone say in the US?

Dec 11 2025, 9:29 AM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Dec 10 2025

akosiaris updated subscribers of T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

Summarizing some discussions we had with @ssastry and @mszabo. We are going to have a deeper look tomorrow, but here's a summary of our current understanding (we focused on Geno's GIF posted above):

Dec 10 2025, 11:47 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Dec 9 2025

akosiaris added a comment to T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice.

FYI, I implented the idea in T280718#8332540 in a Toolforge hosted tool, fc-list. The tool is at https://fc-list.toolforge.org/fc-list.txt . The data gets updated once a day and the image used is the one that is deployed at the time the data get updated. Code is at https://gitlab.wikimedia.org/toolforge-repos/fc-list, I 've posted a change to make https://noc.wikimedia.org/conf/fc-list inform users of that.

Dec 9 2025, 3:13 PM · noc.wikimedia.org, serviceops-radar, Wikimedia-SVG-rendering

Dec 1 2025

akosiaris added a comment to T394778: Build and push images to the docker registry from ml-lab.

Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated ML machine and with the precautions discussed, we are OK with moving forward and unblocking this.

Dec 1 2025, 12:01 PM · Machine-Learning-Team
akosiaris added a comment to T410198: Determine the source of internal requests going through the API gateway..

It depends on your end goal. Do you want to remove them so that they don't mess with your data in a statistical manner? Or to be absolutely certain ?

Ideally i should be certain enough to excempt certain requests from rate limiting. It's doesn't have to be 100%, but shouldn't be exploitable from the outside.

Dec 1 2025, 9:01 AM · ServiceOps-SharedInfra, ServiceOps new, MediaWiki-Platform-Team (Q3 Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, OKR-Work

Nov 28 2025

akosiaris added a comment to T410198: Determine the source of internal requests going through the API gateway..

None of these IP belong in either 10.192.72.0/24 or 10.194.128.0/17, so they are unrelated.

However, the IP address listed previous are indeed wikikube-worker IPs, so node IPs. That is consistent indeed with health checks (which are originating on purpose from the IP address of the host of the pods).

Thank you for clarifying!

Is ther a way to distinguish the following (by address or headers):

  • requests from workers
Nov 28 2025, 2:22 PM · ServiceOps-SharedInfra, ServiceOps new, MediaWiki-Platform-Team (Q3 Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, OKR-Work
akosiaris added a comment to T410198: Determine the source of internal requests going through the API gateway..

All of the 10.192 addresses are wikikube workers, which would correspond to either health checks or something else in wikikube making direct calls (which we actively discourage). The 172.16.* hosts are from WMCS. We need to investigate what kind of traffic we're seeing from these hosts further most likely

@Clement_Goubert shared the following snipped with me:

350   │       service_cluster_cidr:
351   │         v4: "10.192.72.0/24"
352   │         v6: "2620:0:860:306::/116"
353   │       cluster_cidr:
354   │         v4: "10.194.128.0/17"
355   │         v6: "2620:0:860:cabe::/64"

If I understand correctly, "service_cluster_cidr" means "pods", and "cluster_cidr" means "nodes".

Nov 28 2025, 12:25 PM · ServiceOps-SharedInfra, ServiceOps new, MediaWiki-Platform-Team (Q3 Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, OKR-Work

Nov 27 2025

akosiaris closed T345738: etcd in codfw burned all latency SLO error budget as Resolved.

Resolving per last comment. 2 year old task anyway.

Nov 27 2025, 3:39 PM · SRE, Infrastructure-Foundations, serviceops-deprecated

Nov 25 2025

akosiaris added a comment to T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

we are somewhat baffled by the use of the mc-wf hosts for storing the async fragments, this was unexpected.

Storing the fragments resulting from function calls was the original design objective for having access to the production memcached cluster. It's show in this late-2020 architectural sketch with this role (implicitly, but hence the 'write' from the orchestrator which we ended up not doing):

https://commons.wikimedia.org/wiki/File:Wikifunctions_-_Top-level_architectural_model.svg

Nov 25 2025, 2:50 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Nov 24 2025

akosiaris added a comment to T410304: Measure request frequency of thumbnail sizes.

Turnilo for the Telegram Logo (first hit in what @Ladsgroup ) says: Google Proxy as the ISP, in an staggering 85% of the cases. However, it sends those requests with no referrer.

Nov 24 2025, 5:53 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic

Nov 21 2025

akosiaris added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Notes on connection issues discovered during development.

This is our first service deployed on LiftWing cluster, which requires pod-to-pod communication. This is because part of our workflow is filtering by topics of interest, which requires us to first obtain the article topics from the article topic model.

Thus, our setup requires connection between 2 services:

  1. revise-tone-task-generator service deployed in revise-tone-task-generator namespace.
  2. outlink-topic-model service deployed in articletopic-outlink namespace.
Nov 21 2025, 3:48 PM · Patch-For-Review, Machine-Learning-Team
akosiaris added a comment to T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

Update: SRE started looking into this. Unfortunately some unexpected issues prevented us from diving deeper. We want to look into the problem in some more detail next week. For now, we are somewhat baffled by the use of the mc-wf hosts for storing the async fragments, this was unexpected.

Nov 21 2025, 3:10 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work

Nov 17 2025

akosiaris added a comment to T410198: Determine the source of internal requests going through the API gateway..

The 172.16.* hosts are from WMCS. We need to investigate what kind of traffic we're seeing from these hosts further most likely

Nov 17 2025, 3:42 PM · ServiceOps-SharedInfra, ServiceOps new, MediaWiki-Platform-Team (Q3 Kanban Board), Content-Transform-Team (Work In Progress), Essential-Work, PageViewInfo, Growth-Team, OKR-Work

Nov 14 2025

akosiaris closed T172480: Add a jobrunner server to the Scap canary pool as Resolved.

I assume Scap still has the concept of applying the next image to a canary pool in mw-on-k8s first, waiting some time for a potential Logstash error rate increase, and then deciding whether to proceed.

Unless a canary pool was introduced for mw-jobrunner since then, this is presumably still limited to the mw-web and mw-api server groups, and thus still an issue.

Nov 14 2025, 9:11 AM · serviceops-deprecated, MW-Interfaces-Team, Release-Engineering-Team (Seen), Sustainability (Incident Followup), WMF-JobQueue, Scap

Nov 12 2025

akosiaris added a comment to T405461: Embedded function calls getting stuck showing "Function being called..." instead of result, due to (?) split-brain cache problem.

Summarizing a bit from Slack and IRC

Nov 12 2025, 12:08 PM · Abstract Wikipedia team (26Q4 (Apr–Jun)), Wikifunctions, Essential-Work
akosiaris added a member for serviceops-radar: MLechvien-WMF.
Nov 12 2025, 9:52 AM
akosiaris added a member for serviceops-radar: Blake.
Nov 12 2025, 9:49 AM
akosiaris added members for serviceops-radar: Raine, Scott_French, jasmine_.
Nov 12 2025, 9:49 AM

Nov 11 2025

akosiaris added a comment to T385404: Deploy LilyPond 2.24 with Cairo support to shellbox containers.

@akosiaris Do you have any suggestions for getting this task un-stuck?

Nov 11 2025, 9:23 AM · ServiceOps-Upgrades-Hardware, ServiceOps-Services-Oids, Shellbox, ServiceOps new, Upstream, Wikimedia-SVG-rendering, MediaWiki-extensions-Score

Nov 4 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Hi @akosiaris: Following up on this after a discussion during Traffic's planning with @Vgutierrez, and on behalf of the team.

We were curious to know when you would be able to take this, with the understanding that things are busy and we don't expect it to happen immediately. From Traffic's end, we have decided to triage this for Q3 for the rollout and not Q2, given that it is short quarter and we are unlikely to roll such a big change before December that will affect the core sites. (Liberica is already running on all PoPs.) So that's at least our position.

Does that seem fine per you and the planning for Serviceops? Do note that as per @Vgutierrez's last comment, we already have a check in place in the migration cookbook so you do not have to worry about that.

Thanks!

Nov 4 2025, 3:38 PM · ServiceOps new, Patch-For-Review, Prod-Kubernetes, Kubernetes, Traffic
akosiaris added a comment to T408704: offline rackspace wikitech-static, online aws wikitech-static.
Nov 4 2025, 2:57 PM · Infrastructure-Foundations

Oct 23 2025

akosiaris added a comment to P84268 YAMLdiff example.

And this is the yamldiff of the same upgrade

Oct 23 2025, 9:15 AM
akosiaris created P84268 YAMLdiff example.
Oct 23 2025, 9:14 AM

Oct 22 2025

akosiaris added a comment to T407296: Toolforge on bare metal POC.

Does "we" include every SRE in the SRE department? Or is that only the "tools-infra" team? Or is it "tools-platform" + "tools-infra" teams?

Oct 22 2025, 3:45 PM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge
akosiaris added a comment to T407296: Toolforge on bare metal POC.

In T407296#11274680, @Andrew wrote:
"After building a pilot bare-metal toolforge replacement, how hard was it, and does it seem like something that would be easy to maintain, or hard to maintain?"

Oct 22 2025, 11:15 AM · cloud-services-team (FY2025/2026-Q3-Q4), Toolforge

Oct 1 2025

akosiaris edited projects for T401895: Block traffic to RESTBase /page/talk endpoint and sunset it, added: serviceops-radar; removed serviceops-deprecated, Traffic.

@akosiaris per T392491#11167986 and message sent to Wikitech-l earlier Today, this is ready to go.

Oct 1 2025, 8:04 AM · serviceops-radar, Page Content Service
akosiaris updated the task description for T401895: Block traffic to RESTBase /page/talk endpoint and sunset it.
Oct 1 2025, 7:54 AM · serviceops-radar, Page Content Service

Sep 29 2025

akosiaris edited Description on service-utils.
Sep 29 2025, 1:23 PM

Sep 23 2025

akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.
Sep 23 2025, 11:17 AM · Toolforge (Toolforge iteration 25)

Sep 22 2025

akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

I do have some jobs that are time sensitive, but my concern is maintainers not being available to adjust job resources before you deploy this.

Sep 22 2025, 6:12 PM · Toolforge (Toolforge iteration 25)
akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

In this scenario, are we envisioning a situation where cpu requests > allocatable space with or without the cronjobs? If with, then I argue that we will be seeing is a delay in starting workloads, not an inability to have the workloads run. Which again, isn't paging worthy (but it definitely needs an an alert, probably critical). If without, I have to ask what are the scenarios that would trigger this. Lost so many nodes that we are out of capacity? Some how we scheduled so many non transient tools that

I think there's some part of the phrase missing :)

Sep 22 2025, 5:58 PM · Toolforge (Toolforge iteration 25)
akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

Thanks @akosiaris, this is very helpful :)

So I propose then for limits/requests to do:

  • If the user specifies --cpu/--memory, use that as both request and limit (as the expectation is to use that, no more, no less).
  • If not, then:
    • cpu/request: use a default that is the average of the cluster (currently ~7% currently, that'd be 70m, we can round to 100m for now)
    • cpu/limit: use a big number, smaller than a node, we can do something like 4000m (4 cores, as the workers have 8 cores right now)
    • mem/request: use a small, but not too small number, currently it's 256Mi (limit is 512Mi, we use half of it), though we are using ~60% of the actual memory of the cluster, and the requests are around 80% full, we could make it smaller but I think it's kinda ok already
    • mem/limit: this one is trickier, as we know that toolforge workloads are usually very spiky here, with sudden memory usage and then quiet for a long period. Setting this to the same as the request means that we will never overcommit memory-wise any node, but also that most of the time the memory will not be used and any user having bursty workloads will have to specify the maximum memory used manually. Currently we set it to 512Mi (double the request, well, the request to half the limit).

Then for the alerts:

If the cpu requests is bigger than the allocatable space, then users will not be able to get their pods scheduled anywhere, so they will stop running -> page?

Most workloads will continue churning along. It's new ones that won't be able to be scheduled. I 'd argue that is not a page. It's a degraded experience, sure, but a critical alert is good enough, IMHO.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

Sep 22 2025, 4:09 PM · Toolforge (Toolforge iteration 25)
akosiaris added a comment to T404291: Allow proxy server to accept another valid http header instead of 'HOST'.

I 've already replied in T394982#11201112, but I find it improbable that SRE will be implementing such a behavior to accommodate for the change in node.js fetch() API. The HTTP Host header is pretty important across the infrastructure. Rewriting other HTTP headers to it might make debugging and reasoning more difficult than needed.

Sep 22 2025, 11:41 AM · Language and Product Localization, SRE, CXServer, envoy
akosiaris added a comment to T394982: Migrate cxserver in production to node22.

Node 22 stabilizes the fetch API. It is now feature compatible with browsers fetch API. This is generally good, but it also adds more restrictions to what a valid http request can be.  The header field we are setting to pass the wikipedia domain to that wiki proxy is HOST (see the configuration). This is problematic because HOST is a forbidden header.

  1. https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
  2. https://fetch.spec.whatwg.org/#forbidden-request-header

So, the nodes fetch API wont accept HOST header. The wiki proxy will recieve the request without the HOST header and will end up 404 response.

Sep 22 2025, 11:38 AM · CXServer, LPL Projects (Other), LPL Essential (2025 Jul-Oct)

Sep 19 2025

akosiaris added a comment to T404742: Increase RAM and nginx tmpfs on docker registry hosts.

I still have the same concerns as voiced in T359067#9602091, but I also have to be pragmatic. I don't see us solving the bigger registry problems in the next 6 months

Furthermore, Dragonfly helps mitigate some of the concerns. Some more operational ones I have are:

  • Experience has shown that VMs with a lot of memory are more difficult to migrate around in Ganeti, some to the point of stalling. However, I am thinking it shouldn't be any stalling (we 've seen that with VMs with high continuous memory churn and this use case isn't one of these), just taking longer.

Yeah, historically true. I'm hopeful (perhaps naively) that it's better nowadays:

  • 10Gbit is almost everywhere, as Moritz says
  • most Ganeti metal is 128GB RAM, so the schedulability of these large VMs is also easier
  • in addition to the lack of RAM churn given their use case, these VMs are also totally fine to shut down one at a time and cold-migrate
Sep 19 2025, 2:56 PM · Infrastructure-Foundations, serviceops-deprecated
akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

Thanks for this writeup! Couple of inline replies

Sep 19 2025, 2:40 PM · Toolforge (Toolforge iteration 25)

Sep 18 2025

akosiaris added a comment to T404742: Increase RAM and nginx tmpfs on docker registry hosts.

I still have the same concerns as voiced in T359067#9602091, but I also have to be pragmatic. I don't see us solving the bigger registry problems in the next 6 months

Sep 18 2025, 4:16 PM · Infrastructure-Foundations, serviceops-deprecated
akosiaris added a watcher for Toolforge: akosiaris.
Sep 18 2025, 3:32 PM

Sep 16 2025

akosiaris changed the status of T401295: Decide how to use the new clouddb hosts (clouddb102[2-5]) from Open to Stalled.

Setting to stalled, while we figure out the exact details of this one.

Sep 16 2025, 2:01 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), cloud-services-team (FY2025/2026-Q1-Q2), Data-Services, Data-Persistence

Sep 5 2025

akosiaris closed T390438: Frequent HTTP 503 errors from MediaWiki API every 1 or 2 minutes as Resolved.

No, the bot just hits the categorymembers API.

We already reopened this once. Maybe we wait a bit to monitor before closing?

Sep 5 2025, 3:55 PM · SRE, Wikimedia-production-error, MediaWiki-Action-API, MW-Interfaces-Team

Sep 4 2025

akosiaris added a comment to T394917: If we understand capacity planning, we can create an architectural plan how to prevent service outage/throttling, to be executed in subsequent quarters.

I 've gone ahead and shaped up some older notes I had in wikitech and posted a guide for capacity planning in https://wikitech.wikimedia.org/wiki/Kubernetes/Capacity_Planning_of_a_Service

Sep 4 2025, 5:23 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Epic, Essential-Work, function-schemata, function-evaluator, function-orchestrator

Sep 3 2025

akosiaris added a comment to T398106: fix crashing service.

Mentioning this here as well as in T403094: Request to increase function-orchestrator memory to 10GiB. I 've gone through the logstash entries and the related kernel logs and events and there are only 2 instances where the kernel OOMKiller showed up and killed a container and even in that case it was the mcrouter container. I think the issues in this task are a combination of

Sep 3 2025, 6:18 PM · Essential-Work, Patch-For-Review, Abstract Wikipedia team (26Q1 (Jul–Sep)), function-orchestrator
akosiaris updated subscribers of T403094: Request to increase function-orchestrator memory to 10GiB.

The orchestrator is crashing very frequently now due to OOM

Sep 3 2025, 5:55 PM · Abstract Wikipedia team (26Q2 (Oct–Dec)), Essential-Work, function-orchestrator

Aug 28 2025

akosiaris updated the task description for T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system.
Aug 28 2025, 11:49 AM · ChangeProp, WMF-JobQueue, serviceops-radar, OKR-Work

Aug 21 2025

akosiaris added a comment to T399348: Wikifunctions function orchestrator and evaluator test suites failing on GitLab CI with OOM errors.

We could possibly discuss with releng if Digital Ocean runners should still be default or maybe the other way around, default to wmcs and people can opt-in to Digital Ocean runners if they want. (Given that we had a few reports where the suggested solution ended up being to switch to our own infra.)

The WMCS runners have less functionality than the DO runners (T397888, T396924), so we would probably want to at least make these differences well documented and warn everyone of CI failures to be on the lookout for if the default is changed.

Aug 21 2025, 12:58 PM · Abstract Wikipedia team, GitLab (CI & Job Runners), Essential-Work, collaboration-services, Release-Engineering-Team, Patch-For-Review, function-orchestrator, function-evaluator

Aug 14 2025

akosiaris added a comment to T401269: Increase request size limit in backend.
Aug 14 2025, 2:04 PM · OKR-Work, Abstract Wikipedia team (26Q1 (Jul–Sep)), function-evaluator
akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Yup, scheduling it for the weeks of either August 11th or August 18th.

gentle ping, do you need something from my side?

Aug 14 2025, 1:48 PM · ServiceOps new, Patch-For-Review, Prod-Kubernetes, Kubernetes, Traffic
akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Thanks @claime. Stupid typo on my side.

Aug 14 2025, 11:22 AM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops-deprecated
akosiaris added a comment to T401269: Increase request size limit in backend.

What does backend refer to here?

Aug 14 2025, 11:18 AM · OKR-Work, Abstract Wikipedia team (26Q1 (Jul–Sep)), function-evaluator
akosiaris added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

@akosiaris looks like you had a lot of changes for making a new registry, is that registry ready? Or is it still in the testing phase?

Aug 14 2025, 8:45 AM · ServiceOps new
akosiaris added a comment to T401803: mwscript-k8s does not include an environment variable with the username of the executing user.

Re-reading my first response, I realize I might have been a bit unclear. Indeed my focus was to respond to number 2, namely should it contain the shell username of whoever made the wiki? , not to question sending the email in the first place. I agree that this should still be sent. It's a notification as you say, not an auditing mechanism.

Aug 14 2025, 8:31 AM · ServiceOps new, MW-on-K8s, MediaWiki-extensions-WikimediaMaintenance

Aug 13 2025

akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Patches deployed, you should be good to retry @cmassaro

Aug 13 2025, 3:30 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops-deprecated
akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Those are warning (note the W prefix). They wouldn't stop the deployment from happening.

The actual reason is this https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.08.13?id=8Q7Uo5gBgiE0yhV9mhEm

Pasting for convenience

(combined from similar events): Error creating: pods "function-orchestrator-main-orchestrator-df5fdb7c9-dmrg5" is forbidden: exceeded quota: quota-compute-resources, requested: limits.memory=4172Mi, used: limits.memory=7220Mi, limited: limits.memory=10Gi

Simply put, the namespace in staging isn't provisioned for pods that are this large. Resources in staging are scarce. We can probably bump the quotas up a bit, but probably just enough to allow this to run.

Aha, we can pull the memory for staging back down.

Aug 13 2025, 3:30 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops-deprecated
akosiaris renamed T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas from Cannot deploy function-orchestrator due to deprecated appArmor field to Cannot deploy function-orchestrator in staging environment due to insufficient quotas.
Aug 13 2025, 2:54 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops-deprecated
akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Those are warning (note the W prefix). They wouldn't stop the deployment from happening.

Aug 13 2025, 2:49 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops-deprecated
akosiaris added a comment to T401803: mwscript-k8s does not include an environment variable with the username of the executing user.

BTW, on the technical side, mw-script does indeed keep the username in the labels of the job and the pod, e.g.

Aug 13 2025, 2:02 PM · ServiceOps new, MW-on-K8s, MediaWiki-extensions-WikimediaMaintenance
akosiaris added a comment to T401803: mwscript-k8s does not include an environment variable with the username of the executing user.

This functionality was added 10 years ago in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaMaintenance/+/a17c2ef30e0e85ced460f304cf481cdb7d924486%5E%21

Aug 13 2025, 1:25 PM · ServiceOps new, MW-on-K8s, MediaWiki-extensions-WikimediaMaintenance

Jul 18 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

@akosiaris I think we could start considering enabling inbound IPIP traffic on the staging environment, deploying IPIP interfaces (assuming you'll be using the regular kernel networking stack and not some eBPF "magic") shouldn't affect the ability to handle non-encapsulated traffic.

As soon as IPIP encapsulated traffic is handled we can validate that it's working as expected without impacting the traffic coming from load balancers, we used sre.loadbalancer.migrate-service-ipip cookbook to perform this validation for T373020: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/migrate-service-ipip.py#182 and we could use a similar one here

Jul 18 2025, 9:08 AM · ServiceOps new, Patch-For-Review, Prod-Kubernetes, Kubernetes, Traffic

Jul 17 2025

akosiaris added a comment to T399681: MediaWiki periodic job update-special-pages-s5 failed.

Hmm… what's interesting about this task & the other two most-recent MediaWiki-Special-pages cron-job-failure tasks (T396454, T396977) is that all three of these failures have been due to (what seem like) database connection errors.

Jul 17 2025, 1:42 PM · serviceops-deprecated, Wikimedia-production-error, MediaWiki-Special-pages
akosiaris added a comment to T396977: MediaWiki periodic job update-special-pages-s8 failed.

This was probably due to https://sal.toolforge.org/log/huohd5cB8tZ8Ohr03499

Jul 17 2025, 1:38 PM · Wikimedia-production-error, MediaWiki-Special-pages
akosiaris added a comment to T399681: MediaWiki periodic job update-special-pages-s5 failed.

There you go.

Jul 17 2025, 1:15 PM · serviceops-deprecated, Wikimedia-production-error, MediaWiki-Special-pages

Jul 16 2025

akosiaris added a comment to T390087: eqiad: VMs requested for Data Persistence automation and testbeds.

We don't have strict requirements around the intra DC availability zones.

Jul 16 2025, 3:06 PM · Infrastructure-Foundations, vm-requests
akosiaris updated the task description for T390087: eqiad: VMs requested for Data Persistence automation and testbeds.
Jul 16 2025, 2:59 PM · Infrastructure-Foundations, vm-requests
akosiaris renamed T380807: Provide a dedicated for Abstract Wikipedia Rust image from Have SRE provide a production-ready Rust image upstream to Provide a dedicated for Abstract Wikipedia Rust image.
Jul 16 2025, 1:55 PM · Abstract Wikipedia team, Essential-Work, serviceops-deprecated, function-evaluator
akosiaris updated subscribers of T390087: eqiad: VMs requested for Data Persistence automation and testbeds.

Thanks for tagging me in this one. This is more Infrastructure-Foundations territory these days, so I am adding the relevant people as well for their information.

Jul 16 2025, 1:53 PM · Infrastructure-Foundations, vm-requests

Jul 15 2025

akosiaris added a comment to T361768: Migrate and re-deploy eventgate using new service-utils.

This includes upgrade to Nodejs 20.

Hi! Has this happened? Looking at the images currently deployed per deployments-charts repo


$ podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.11.0 -c "nodejs -v"
v20.5.1

and

podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.14.0 -c "nodejs -v"
v20.5.1

says yes, but I guess it doesn't hurt to double check that we are all on the same page.

Yes, this ended up happening separately with T383814

Jul 15 2025, 7:56 AM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Jul 11 2025

akosiaris added a comment to T361768: Migrate and re-deploy eventgate using new service-utils.

This includes upgrade to Nodejs 20.

Jul 11 2025, 1:21 PM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Jul 8 2025

akosiaris closed T380958: httpb sometimes fails upon deployment with a HTTP 503 as Resolved.
Jul 8 2025, 2:22 PM · Release-Engineering-Team (Radar), Deployments, serviceops-deprecated, Wikimedia-production-error
akosiaris claimed T380958: httpb sometimes fails upon deployment with a HTTP 503.

No new reports, I 'll resolve, feel free to reopen.

Jul 8 2025, 2:22 PM · Release-Engineering-Team (Radar), Deployments, serviceops-deprecated, Wikimedia-production-error

Jul 7 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

All kubernetes clusters are now configured to use MTU 1460. This will take some time (weeks) to fully propagate, as this requires a pod restart. Deployments, node maintenance, evictions and other events that end up restarting or rescheduling pods will trigger it. In a few weeks we should be in a position to look at the few left hanging fruits and manually restart those.

Jul 7 2025, 1:48 PM · ServiceOps new, Patch-For-Review, Prod-Kubernetes, Kubernetes, Traffic
akosiaris added a comment to T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0).

wikikube workers repooled.

Jul 7 2025, 12:45 PM · serviceops-deprecated, Infrastructure-Foundations, netops
akosiaris added a project to T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0): serviceops-deprecated.
Jul 7 2025, 12:04 PM · serviceops-deprecated, Infrastructure-Foundations, netops
akosiaris added a comment to T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0).

Sweet, what about 12:00UTC on Monday 7th ?

Jul 7 2025, 12:03 PM · serviceops-deprecated, Infrastructure-Foundations, netops

Jun 30 2025

akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

The 2 patches have been merged and will ride out today's deployments. Hopefully we 'll be able to successfully resolve this task next week.

Jun 30 2025, 10:25 AM · Release-Engineering-Team (Radar), Deployments, serviceops-deprecated, Wikimedia-production-error

Jun 27 2025

akosiaris updated subscribers of T380544: Temporarily run more refreshLinks jobs on Commons.

I 'll file a patch though to increase the maximum bucket.

Jun 27 2025, 11:16 AM · MW-Interfaces-Team, Commons, serviceops-deprecated, WMF-JobQueue
akosiaris added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

Okay, would there be a problem with running more refreshLinks jobs across all wikis? 😇

Jun 27 2025, 10:22 AM · MW-Interfaces-Team, Commons, serviceops-deprecated, WMF-JobQueue