Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Projects (34)

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Oct 3 2014, 8:40 AM (579 w, 5 d)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Yesterday

akosiaris added a comment to T385404: Deploy Lilypond 2.24 with cairo support to shellbox containers.

@akosiaris Do you have any suggestions for getting this task un-stuck?

Tue, Nov 11, 9:23 AM · serviceops, Upstream, Wikimedia-SVG-rendering, MediaWiki-extensions-Score

Tue, Nov 4

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Hi @akosiaris: Following up on this after a discussion during Traffic's planning with @Vgutierrez, and on behalf of the team.

We were curious to know when you would be able to take this, with the understanding that things are busy and we don't expect it to happen immediately. From Traffic's end, we have decided to triage this for Q3 for the rollout and not Q2, given that it is short quarter and we are unlikely to roll such a big change before December that will affect the core sites. (Liberica is already running on all PoPs.) So that's at least our position.

Does that seem fine per you and the planning for Serviceops? Do note that as per @Vgutierrez's last comment, we already have a check in place in the migration cookbook so you do not have to worry about that.

Thanks!

Tue, Nov 4, 3:38 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
akosiaris added a comment to T408704: offline rackspace wikitech-static, online aws wikitech-static.
Tue, Nov 4, 2:57 PM · SRE

Thu, Oct 23

akosiaris added a comment to P84268 YAMLdiff example.

And this is the yamldiff of the same upgrade

Thu, Oct 23, 9:15 AM
akosiaris created P84268 YAMLdiff example.
Thu, Oct 23, 9:14 AM

Wed, Oct 22

akosiaris added a comment to T407296: Toolforge on bare metal POC.

Does "we" include every SRE in the SRE department? Or is that only the "tools-infra" team? Or is it "tools-platform" + "tools-infra" teams?

Wed, Oct 22, 3:45 PM · Toolforge, cloud-services-team
akosiaris added a comment to T407296: Toolforge on bare metal POC.

In T407296#11274680, @Andrew wrote:
"After building a pilot bare-metal toolforge replacement, how hard was it, and does it seem like something that would be easy to maintain, or hard to maintain?"

Wed, Oct 22, 11:15 AM · Toolforge, cloud-services-team

Oct 1 2025

akosiaris edited projects for T401895: Block traffic to RESTBase /page/talk endpoint and sunset it, added: serviceops-radar; removed serviceops, Traffic.

@akosiaris per T392491#11167986 and message sent to Wikitech-l earlier Today, this is ready to go.

Oct 1 2025, 8:04 AM · serviceops-radar, Page Content Service
akosiaris updated the task description for T401895: Block traffic to RESTBase /page/talk endpoint and sunset it.
Oct 1 2025, 7:54 AM · serviceops-radar, Page Content Service

Sep 29 2025

akosiaris edited Description on service-utils.
Sep 29 2025, 1:23 PM

Sep 23 2025

akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.
Sep 23 2025, 11:17 AM · Toolforge (Toolforge iteration 25)

Sep 22 2025

akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

I do have some jobs that are time sensitive, but my concern is maintainers not being available to adjust job resources before you deploy this.

Sep 22 2025, 6:12 PM · Toolforge (Toolforge iteration 25)
akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

In this scenario, are we envisioning a situation where cpu requests > allocatable space with or without the cronjobs? If with, then I argue that we will be seeing is a delay in starting workloads, not an inability to have the workloads run. Which again, isn't paging worthy (but it definitely needs an an alert, probably critical). If without, I have to ask what are the scenarios that would trigger this. Lost so many nodes that we are out of capacity? Some how we scheduled so many non transient tools that

I think there's some part of the phrase missing :)

Sep 22 2025, 5:58 PM · Toolforge (Toolforge iteration 25)
akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

Thanks @akosiaris, this is very helpful :)

So I propose then for limits/requests to do:

  • If the user specifies --cpu/--memory, use that as both request and limit (as the expectation is to use that, no more, no less).
  • If not, then:
    • cpu/request: use a default that is the average of the cluster (currently ~7% currently, that'd be 70m, we can round to 100m for now)
    • cpu/limit: use a big number, smaller than a node, we can do something like 4000m (4 cores, as the workers have 8 cores right now)
    • mem/request: use a small, but not too small number, currently it's 256Mi (limit is 512Mi, we use half of it), though we are using ~60% of the actual memory of the cluster, and the requests are around 80% full, we could make it smaller but I think it's kinda ok already
    • mem/limit: this one is trickier, as we know that toolforge workloads are usually very spiky here, with sudden memory usage and then quiet for a long period. Setting this to the same as the request means that we will never overcommit memory-wise any node, but also that most of the time the memory will not be used and any user having bursty workloads will have to specify the maximum memory used manually. Currently we set it to 512Mi (double the request, well, the request to half the limit).

Then for the alerts:

If the cpu requests is bigger than the allocatable space, then users will not be able to get their pods scheduled anywhere, so they will stop running -> page?

Most workloads will continue churning along. It's new ones that won't be able to be scheduled. I 'd argue that is not a page. It's a degraded experience, sure, but a critical alert is good enough, IMHO.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

Sep 22 2025, 4:09 PM · Toolforge (Toolforge iteration 25)
akosiaris added a comment to T404291: Allow proxy server to accept another valid http header instead of 'HOST'.

I 've already replied in T394982#11201112, but I find it improbable that SRE will be implementing such a behavior to accommodate for the change in node.js fetch() API. The HTTP Host header is pretty important across the infrastructure. Rewriting other HTTP headers to it might make debugging and reasoning more difficult than needed.

Sep 22 2025, 11:41 AM · Language and Product Localization, SRE, CXServer, envoy
akosiaris added a comment to T394982: Migrate cxserver in production to node22.

Node 22 stabilizes the fetch API. It is now feature compatible with browsers fetch API. This is generally good, but it also adds more restrictions to what a valid http request can be.  The header field we are setting to pass the wikipedia domain to that wiki proxy is HOST (see the configuration). This is problematic because HOST is a forbidden header.

  1. https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
  2. https://fetch.spec.whatwg.org/#forbidden-request-header

So, the nodes fetch API wont accept HOST header. The wiki proxy will recieve the request without the HOST header and will end up 404 response.

Sep 22 2025, 11:38 AM · CXServer, LPL Projects (Other), LPL Essential (2025 Jul-Oct)

Sep 19 2025

akosiaris added a comment to T404742: Increase RAM and nginx tmpfs on docker registry hosts.

I still have the same concerns as voiced in T359067#9602091, but I also have to be pragmatic. I don't see us solving the bigger registry problems in the next 6 months

Furthermore, Dragonfly helps mitigate some of the concerns. Some more operational ones I have are:

  • Experience has shown that VMs with a lot of memory are more difficult to migrate around in Ganeti, some to the point of stalling. However, I am thinking it shouldn't be any stalling (we 've seen that with VMs with high continuous memory churn and this use case isn't one of these), just taking longer.

Yeah, historically true. I'm hopeful (perhaps naively) that it's better nowadays:

  • 10Gbit is almost everywhere, as Moritz says
  • most Ganeti metal is 128GB RAM, so the schedulability of these large VMs is also easier
  • in addition to the lack of RAM churn given their use case, these VMs are also totally fine to shut down one at a time and cold-migrate
Sep 19 2025, 2:56 PM · Infrastructure-Foundations, serviceops
akosiaris added a comment to T404726: [tools,infra,k8s] scale up the cluster, specifically CPU.

Thanks for this writeup! Couple of inline replies

Sep 19 2025, 2:40 PM · Toolforge (Toolforge iteration 25)

Sep 18 2025

akosiaris added a comment to T404742: Increase RAM and nginx tmpfs on docker registry hosts.

I still have the same concerns as voiced in T359067#9602091, but I also have to be pragmatic. I don't see us solving the bigger registry problems in the next 6 months

Sep 18 2025, 4:16 PM · Infrastructure-Foundations, serviceops
akosiaris added a watcher for Toolforge: akosiaris.
Sep 18 2025, 3:32 PM

Sep 16 2025

akosiaris changed the status of T401295: Decide how to use the new clouddb hosts (clouddb102[2-5]) from Open to Stalled.

Setting to stalled, while we figure out the exact details of this one.

Sep 16 2025, 2:01 PM · Data-Platform-SRE (2025.10.17 - 2025.11.07), cloud-services-team (FY2025/26-Q1), Data-Services, Data-Persistence

Sep 5 2025

akosiaris closed T390438: Frequent HTTP 503 errors from MediaWiki API every 1 or 2 minutes as Resolved.

No, the bot just hits the categorymembers API.

We already reopened this once. Maybe we wait a bit to monitor before closing?

Sep 5 2025, 3:55 PM · SRE, Wikimedia-production-error, MediaWiki-Action-API, MW-Interfaces-Team

Sep 4 2025

akosiaris added a comment to T394917: If we understand capacity planning, we can create an architectural plan how to prevent service outage/throttling, to be executed in subsequent quarters.

I 've gone ahead and shaped up some older notes I had in wikitech and posted a guide for capacity planning in https://wikitech.wikimedia.org/wiki/Kubernetes/Capacity_Planning_of_a_Service

Sep 4 2025, 5:23 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Epic, Essential-Work, function-schemata, function-evaluator, function-orchestrator

Sep 3 2025

akosiaris added a comment to T398106: fix crashing service.

Mentioning this here as well as in T403094: Request to increase function-orchestrator memory to 10GiB. I 've gone through the logstash entries and the related kernel logs and events and there are only 2 instances where the kernel OOMKiller showed up and killed a container and even in that case it was the mcrouter container. I think the issues in this task are a combination of

Sep 3 2025, 6:18 PM · Essential-Work, Patch-For-Review, Abstract Wikipedia team (26Q1 (Jul–Sep)), function-orchestrator
akosiaris updated subscribers of T403094: Request to increase function-orchestrator memory to 10GiB.

The orchestrator is crashing very frequently now due to OOM

Sep 3 2025, 5:55 PM · Abstract Wikipedia team (26Q2 (Oct–Dec)), Essential-Work, function-orchestrator

Aug 28 2025

akosiaris updated the task description for T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system.
Aug 28 2025, 11:49 AM · ChangeProp, WMF-JobQueue, serviceops-radar, OKR-Work

Aug 21 2025

akosiaris added a comment to T399348: Wikifunctions function orchestrator and evaluator test suites failing on GitLab CI with OOM errors.

We could possibly discuss with releng if Digital Ocean runners should still be default or maybe the other way around, default to wmcs and people can opt-in to Digital Ocean runners if they want. (Given that we had a few reports where the suggested solution ended up being to switch to our own infra.)

The WMCS runners have less functionality than the DO runners (T397888, T396924), so we would probably want to at least make these differences well documented and warn everyone of CI failures to be on the lookout for if the default is changed.

Aug 21 2025, 12:58 PM · Abstract Wikipedia team (26Q2 (Oct–Dec)), GitLab (CI & Job Runners), Essential-Work, collaboration-services, Release-Engineering-Team, Patch-For-Review, function-orchestrator, function-evaluator

Aug 14 2025

akosiaris added a comment to T401269: Increase request size limit in backend.
Aug 14 2025, 2:04 PM · OKR-Work, Abstract Wikipedia team (26Q1 (Jul–Sep)), function-evaluator
akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Yup, scheduling it for the weeks of either August 11th or August 18th.

gentle ping, do you need something from my side?

Aug 14 2025, 1:48 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Thanks @claime. Stupid typo on my side.

Aug 14 2025, 11:22 AM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops
akosiaris added a comment to T401269: Increase request size limit in backend.

What does backend refer to here?

Aug 14 2025, 11:18 AM · OKR-Work, Abstract Wikipedia team (26Q1 (Jul–Sep)), function-evaluator
akosiaris added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

@akosiaris looks like you had a lot of changes for making a new registry, is that registry ready? Or is it still in the testing phase?

Aug 14 2025, 8:45 AM · Patch-For-Review, serviceops
akosiaris added a comment to T401803: mwscript-k8s does not include an environment variable with the username of the executing user.

Re-reading my first response, I realize I might have been a bit unclear. Indeed my focus was to respond to number 2, namely should it contain the shell username of whoever made the wiki? , not to question sending the email in the first place. I agree that this should still be sent. It's a notification as you say, not an auditing mechanism.

Aug 14 2025, 8:31 AM · serviceops, MW-on-K8s, MediaWiki-extensions-WikimediaMaintenance

Aug 13 2025

akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Patches deployed, you should be good to retry @cmassaro

Aug 13 2025, 3:30 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops
akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Those are warning (note the W prefix). They wouldn't stop the deployment from happening.

The actual reason is this https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.08.13?id=8Q7Uo5gBgiE0yhV9mhEm

Pasting for convenience

(combined from similar events): Error creating: pods "function-orchestrator-main-orchestrator-df5fdb7c9-dmrg5" is forbidden: exceeded quota: quota-compute-resources, requested: limits.memory=4172Mi, used: limits.memory=7220Mi, limited: limits.memory=10Gi

Simply put, the namespace in staging isn't provisioned for pods that are this large. Resources in staging are scarce. We can probably bump the quotas up a bit, but probably just enough to allow this to run.

Aha, we can pull the memory for staging back down.

Aug 13 2025, 3:30 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops
akosiaris renamed T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas from Cannot deploy function-orchestrator due to deprecated appArmor field to Cannot deploy function-orchestrator in staging environment due to insufficient quotas.
Aug 13 2025, 2:54 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops
akosiaris added a comment to T401833: Cannot deploy function-orchestrator in staging environment due to insufficient quotas.

Those are warning (note the W prefix). They wouldn't stop the deployment from happening.

Aug 13 2025, 2:49 PM · Abstract Wikipedia team (26Q1 (Jul–Sep)), Essential-Work, serviceops
akosiaris added a comment to T401803: mwscript-k8s does not include an environment variable with the username of the executing user.

BTW, on the technical side, mw-script does indeed keep the username in the labels of the job and the pod, e.g.

Aug 13 2025, 2:02 PM · serviceops, MW-on-K8s, MediaWiki-extensions-WikimediaMaintenance
akosiaris added a comment to T401803: mwscript-k8s does not include an environment variable with the username of the executing user.

This functionality was added 10 years ago in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaMaintenance/+/a17c2ef30e0e85ced460f304cf481cdb7d924486%5E%21

Aug 13 2025, 1:25 PM · serviceops, MW-on-K8s, MediaWiki-extensions-WikimediaMaintenance

Jul 18 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

@akosiaris I think we could start considering enabling inbound IPIP traffic on the staging environment, deploying IPIP interfaces (assuming you'll be using the regular kernel networking stack and not some eBPF "magic") shouldn't affect the ability to handle non-encapsulated traffic.

As soon as IPIP encapsulated traffic is handled we can validate that it's working as expected without impacting the traffic coming from load balancers, we used sre.loadbalancer.migrate-service-ipip cookbook to perform this validation for T373020: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/migrate-service-ipip.py#182 and we could use a similar one here

Jul 18 2025, 9:08 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic

Jul 17 2025

akosiaris added a comment to T399681: MediaWiki periodic job update-special-pages-s5 failed.

Hmm… what's interesting about this task & the other two most-recent MediaWiki-Special-pages cron-job-failure tasks (T396454, T396977) is that all three of these failures have been due to (what seem like) database connection errors.

Jul 17 2025, 1:42 PM · serviceops, Wikimedia-production-error, MediaWiki-Special-pages
akosiaris added a comment to T396977: MediaWiki periodic job update-special-pages-s8 failed.

This was probably due to https://sal.toolforge.org/log/huohd5cB8tZ8Ohr03499

Jul 17 2025, 1:38 PM · Wikimedia-production-error, MediaWiki-Special-pages
akosiaris added a comment to T399681: MediaWiki periodic job update-special-pages-s5 failed.

There you go.

Jul 17 2025, 1:15 PM · serviceops, Wikimedia-production-error, MediaWiki-Special-pages

Jul 16 2025

akosiaris added a comment to T390087: eqiad: VMs requested for Data Persistence automation and testbeds.

We don't have strict requirements around the intra DC availability zones.

Jul 16 2025, 3:06 PM · Infrastructure-Foundations, vm-requests
akosiaris updated the task description for T390087: eqiad: VMs requested for Data Persistence automation and testbeds.
Jul 16 2025, 2:59 PM · Infrastructure-Foundations, vm-requests
akosiaris renamed T380807: Provide a dedicated for Abstract Wikipedia Rust image from Have SRE provide a production-ready Rust image upstream to Provide a dedicated for Abstract Wikipedia Rust image.
Jul 16 2025, 1:55 PM · Abstract Wikipedia team, Essential-Work, serviceops, function-evaluator
akosiaris updated subscribers of T390087: eqiad: VMs requested for Data Persistence automation and testbeds.

Thanks for tagging me in this one. This is more Infrastructure-Foundations territory these days, so I am adding the relevant people as well for their information.

Jul 16 2025, 1:53 PM · Infrastructure-Foundations, vm-requests

Jul 15 2025

akosiaris added a comment to T361768: Migrate and re-deploy eventgate using new service-utils.

This includes upgrade to Nodejs 20.

Hi! Has this happened? Looking at the images currently deployed per deployments-charts repo


$ podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.11.0 -c "nodejs -v"
v20.5.1

and

podman run --rm -it --entrypoint /bin/sh docker-registry.wikimedia.org/repos/data-engineering/eventgate-wikimedia:v1.14.0 -c "nodejs -v"
v20.5.1

says yes, but I guess it doesn't hurt to double check that we are all on the same page.

Yes, this ended up happening separately with T383814

Jul 15 2025, 7:56 AM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Jul 11 2025

akosiaris added a comment to T361768: Migrate and re-deploy eventgate using new service-utils.

This includes upgrade to Nodejs 20.

Jul 11 2025, 1:21 PM · Event-Platform, Data-Engineering (Q1 FY25/26 July 1st - September 30th), service-utils

Jul 8 2025

akosiaris closed T380958: httpb sometimes fails upon deployment with a HTTP 503 as Resolved.
Jul 8 2025, 2:22 PM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error
akosiaris claimed T380958: httpb sometimes fails upon deployment with a HTTP 503.

No new reports, I 'll resolve, feel free to reopen.

Jul 8 2025, 2:22 PM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

Jul 7 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

All kubernetes clusters are now configured to use MTU 1460. This will take some time (weeks) to fully propagate, as this requires a pod restart. Deployments, node maintenance, evictions and other events that end up restarting or rescheduling pods will trigger it. In a few weeks we should be in a position to look at the few left hanging fruits and manually restart those.

Jul 7 2025, 1:48 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
akosiaris added a comment to T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0).

wikikube workers repooled.

Jul 7 2025, 12:45 PM · serviceops, Infrastructure-Foundations, netops
akosiaris added a project to T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0): serviceops.
Jul 7 2025, 12:04 PM · serviceops, Infrastructure-Foundations, netops
akosiaris added a comment to T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0).

Sweet, what about 12:00UTC on Monday 7th ?

Jul 7 2025, 12:03 PM · serviceops, Infrastructure-Foundations, netops

Jun 30 2025

akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

The 2 patches have been merged and will ride out today's deployments. Hopefully we 'll be able to successfully resolve this task next week.

Jun 30 2025, 10:25 AM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

Jun 27 2025

akosiaris updated subscribers of T380544: Temporarily run more refreshLinks jobs on Commons.

I 'll file a patch though to increase the maximum bucket.

Jun 27 2025, 11:16 AM · MW-Interfaces-Team, Commons, serviceops, WMF-JobQueue
akosiaris added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

Okay, would there be a problem with running more refreshLinks jobs across all wikis? 😇

Jun 27 2025, 10:22 AM · MW-Interfaces-Team, Commons, serviceops, WMF-JobQueue

Jun 26 2025

akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

I have https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1164269/2 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1164270/2 lined up for deployment on Monday, they should resolve what is witnessed by deployers.

Jun 26 2025, 4:51 PM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

Jun 24 2025

akosiaris added a comment to T397653: Incorporate new arm64 host in our tooling.

Response from the redfish API, using my gerrit change above to print out the response

Jun 24 2025, 1:38 PM · Patch-For-Review, ARM support, Infrastructure-Foundations, serviceops
akosiaris added a project to T397653: Incorporate new arm64 host in our tooling: ARM support.
Jun 24 2025, 10:27 AM · Patch-For-Review, ARM support, Infrastructure-Foundations, serviceops
akosiaris added a comment to T380544: Temporarily run more refreshLinks jobs on Commons.

I guess the most likely explanation is that the processing of the updates for MariaDB and OpenSearch happens separately, and the MariaDB updates to the templatelinks table are progressing rather more slowly. Between T380544#10346771 (79.1M) and T380544#10924765 (51.5M), almost seven months passed; based on this, we can extrapolate that the remaining 51.5M rows would take another year or so to process. (Or, if we assume the CirrusSearch updates are complete and the 16.9M pages there still have genuine uses of Template:SDC statement has value, then there are 34.6M rows left which would take between 8 and 9 months at the current rate.) That’s not as bad as the “ten years” estimate, and certainly not bad enough for High priority, but I still feel like a somewhat higher rate of these jobs wouldn’t hurt.

Jun 24 2025, 9:16 AM · MW-Interfaces-Team, Commons, serviceops, WMF-JobQueue

Jun 23 2025

akosiaris added a comment to T320811: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session).

Our first arm64 server just got racked. We 'll need to figure out how to incorporate it in our tooling, see T397653, but we are finally moving.

Jun 23 2025, 5:40 PM · Infrastructure-Foundations, serviceops, ARM support, SRE
akosiaris created T397653: Incorporate new arm64 host in our tooling.
Jun 23 2025, 5:38 PM · Patch-For-Review, ARM support, Infrastructure-Foundations, serviceops
akosiaris added a project to T393015: Q4:rack/setup/install build2003.codfw.wmnet: Infrastructure-Foundations.

This server isn't getting a clean provisioning run

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
  raw_ret = runner.run()
            ^^^^^^^^^^^^
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 294, in run
  self._config_host()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 433, in _config_host
  bios_attributes = self._get_bios_settings()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 369, in _get_bios_settings
  return bios_settings["Attributes"]
         ~~~~~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'Attributes'
Jun 23 2025, 5:24 PM · Infrastructure-Foundations, Patch-For-Review, SRE, serviceops, ops-codfw, DC-Ops
akosiaris closed T397341: Wikifunctions orchestrator service in staging k8s cannot make network calls, gets getaddrinfo EAI_AGAIN / no healthy upstream as Resolved.

Copy pasting from the commit message of the change just merged and deployed

Jun 23 2025, 2:14 PM · Wikimedia-production-error, serviceops, Essential-Work, Abstract Wikipedia team
akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

Judging from the lack of comments in the last 2 weeks and repeated tests by yours truly, there is a chance the approach worked on the mwdebug hosts (which are slated for removal anyway), so moving forward this week with the mwdebug on k8s as well.

Jun 23 2025, 1:37 PM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

Jun 19 2025

akosiaris created T397415: aux-k8s-worker UEFI enabling cookbook failure with HTTP 400 when talking to redfish API.
Jun 19 2025, 8:38 AM · Infrastructure-Foundations

Jun 18 2025

akosiaris closed T380544: Temporarily run more refreshLinks jobs on Commons, a subtask of T343131: Commons database is growing way too fast, as Resolved.
Jun 18 2025, 2:52 PM · MW-1.44-notes (1.44.0-wmf.3; 2024-11-12), Patch-For-Review, MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), MediaWiki-Platform-Team (Radar), Data-Persistence (work done), Commons
akosiaris closed T380544: Temporarily run more refreshLinks jobs on Commons as Resolved.

Overall, I’m inclined to say that by now it’s safe to just close this task and assume the job queue is doing its job reasonably well.

Jun 18 2025, 2:51 PM · MW-Interfaces-Team, Commons, serviceops, WMF-JobQueue

Jun 16 2025

akosiaris added a comment to T393557: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage.

Note: I actually do not know how these are generated, so it's plausible that it's expected as long as the end point still exists even if it's returning 403 - but https://en.wikipedia.org/api/rest_v1/#/Page%20content is still documenting the existence of the endpoint.

Jun 16 2025, 1:52 PM · Essential-Work, Content-Transform-Team (Work In Progress), serviceops, Traffic, RESTBase, RESTBase Sunsetting

Jun 13 2025

akosiaris added a comment to T394476: Onboard the Docker Registry to apus.

Thanks, this helped. After creating manually the registry-restricted (docker-registry will return 503 if it doesn't exist) bucket with s3cmd, and some temporary hacks to prove this works, I managed to partially push an image.

Jun 13 2025, 2:53 PM · Ceph, SRE-swift-storage, Data-Persistence, serviceops
akosiaris created P77953 s3cmd of the test registry.
Jun 13 2025, 2:42 PM

Jun 11 2025

akosiaris added a comment to T393557: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage.

@akosiaris this is ready to for traffic blocking. Please let us know if that make sense or if you need any other information.

Jun 11 2025, 1:09 PM · Essential-Work, Content-Transform-Team (Work In Progress), serviceops, Traffic, RESTBase, RESTBase Sunsetting
akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

I 've gone ahead and switch all of aux-k8s to MTU 1460. This time around, I went for a more hands off approach, namely:

Jun 11 2025, 7:33 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic

Jun 10 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

@akosiaris a quick question about this:

meaning that ICMP traffic to e.g. coredns gets dropped

In terms of pmtud that means that if coredns sends large UDP packets - which get dropped elsewhere - it won't get the ICMP "packet too big" messages back. But that is not really a worry. The CoreDNS PODs have a lower MTU than pretty much everything on the network, they are not going to send packets that are too large for anything else.

Jun 10 2025, 3:43 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
akosiaris added a comment to T396074: Unable to deploy wikifunctions services in production: Helm timeout for prod push of memcached access.

Tried deploying this, see https://sal.toolforge.org/log/drexRZcBvg159pQr5-r8 and logs have the following

Jun 10 2025, 8:15 AM · Essential-Work, Abstract Wikipedia team, serviceops, function-orchestrator
akosiaris added a comment to T396033: Unable to deploy wikifunctions services in production: Pool wf-codfw has no failover servers list, route /local/wf.

Indeed, I commented in the wrong task. My mistake, thanks.

Jun 10 2025, 8:14 AM · function-evaluator, function-orchestrator, Abstract Wikipedia team, Essential-Work, serviceops, Wikifunctions
akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

I 've run a battery of tests in the previous days against mwdebug2001 and mwdebug2002, having 2 different configurations regarding retry_on policy. mwdebug2001 had connect-failure and mwdebug2002 had 5xx. The latter includes the former for what is worth. The tests were just invocations of the command in T380958#10887212. mwdebug2001 had the occasional fail transaction in a number of those tests, mwdebug2002 has consistently not return an error. Thus, https://gerrit.wikimedia.org/r/1155117 which I just merged. Let it it soak for the week. Should this work OK, it needs to be backported to the configuration of mwdebug on k8s as well.

Jun 10 2025, 8:09 AM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

Jun 6 2025

akosiaris added a comment to T396033: Unable to deploy wikifunctions services in production: Pool wf-codfw has no failover servers list, route /local/wf.

Logs have the following

E20250606 14:46:30.010773     1 Server-inl.h:593] mcrouter error (router name '11213', flavor 'unknown', service 'mcrouter'): Failed to configure, initial error 'Failed to reconfigure: Unknown RouteHandle: KeyModifyRoute line: 81', from backup 'Failed to reconfigure: Unknown RouteHandle: KeyModifyRoute line: 81'
Jun 6 2025, 2:59 PM · function-evaluator, function-orchestrator, Abstract Wikipedia team, Essential-Work, serviceops, Wikifunctions
akosiaris added a comment to T395920: Add a section to the SLO template that explains SLO windows, and Pyrra's dashboards and alerts.

Thanks for this! I 've also landed a round of updates today in https://wikitech.wikimedia.org/w/index.php?title=SLO/Template_instructions/Dashboards_and_alerts&diff=prev&oldid=2309464.

Jun 6 2025, 1:13 PM · SRE-SLO
akosiaris changed the status of T393173: Publish Wikimedia trixie base Docker image from Open to Stalled.

Switch to stalled while waiting for the release.

Jun 6 2025, 11:37 AM · serviceops

Jun 5 2025

akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

Close to 100k requests later to Main_Page of spcom.wikimedia.org, 1 error. This is below very close 99.999% btw. Siege rounds up to 100%, but actual availability is 99,998% which is very very very (did I say very enough?) good.

Jun 5 2025, 2:38 PM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error
akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

I 'll be hammering mwdebug2001 with siege for the next 30 minutes in an effort to reproduce this. Command is run from deploy1003 and it's

Jun 5 2025, 10:51 AM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error
akosiaris added a comment to T380958: httpb sometimes fails upon deployment with a HTTP 503.

encountered this deploying today - https://spiderpig.wikimedia.org/jobs/154

Screenshot 2025-06-04 at 2.37.28 PM.png (592×2 px, 184 KB)

not sure if it was a blip in the matrix or related but retrying testserver checks seemed to resolve it

Jun 5 2025, 10:37 AM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

Jun 4 2025

akosiaris updated subscribers of T380958: httpb sometimes fails upon deployment with a HTTP 503.

This is surfacing once every couple of days or so, at least per Logstash, which counts 23 instances in the last 2 months. I was looking at one that @Marostegui met today.

Jun 4 2025, 8:30 AM · Release-Engineering-Team (Radar), Deployments, serviceops, Wikimedia-production-error

May 28 2025

akosiaris added a comment to T390753: Work out why function-orchestrator's and function-evaluator's OTel telemetry isn't showing up.

Thanks @ecarg. I 'll defer to @mszabo for reviewing. He is definitely better equipped currently than I to review.

May 28 2025, 2:03 PM · OKR-Work, Abstract Wikipedia team (25Q4 (Apr–Jun)), function-orchestrator, function-evaluator
akosiaris created T395451: Make the JobQueue compatible with the MediaWiki Single version HTTP routing system.
May 28 2025, 1:56 PM · ChangeProp, WMF-JobQueue, serviceops-radar, OKR-Work

May 26 2025

akosiaris edited projects for T357122: linkrecommendation-internal regularly uses more than 95% of its memory limit, added: serviceops-radar; removed serviceops.

In the interesting of stopping the non actioned upon alerts, I 've gone ahead and excluded linkrecommendation from this alert. Patch is https://gerrit.wikimedia.org/r/c/operations/alerts/+/1150726. I 've also it from serviceops to serviceops-radar. Feel free to undo when the time comes that someone works on it.

May 26 2025, 4:13 PM · serviceops-radar, Observability-Tracing, Patch-For-Review, Growth-Team, Add-Link-Structured-Task, Prod-Kubernetes, Kubernetes
akosiaris added a comment to T390517: Remove recommendation-api from the REST API offerings.

Per https://w.wiki/DeQh, we are well within our estimations about the day amount of requests still reaching this service. It's in the order of ~1900 and we had estimated ~2k.

May 26 2025, 3:37 PM · API Platform (RESTBase Deprecation Roadmap), serviceops
akosiaris added a comment to T391333: Revisit default envoy histogram buckets.

I 've spent a good deal of time today to do what I assumed to be easy, that is perform the above in our ingressgateways. I have failed up to now. I 'll need to revisit this with a fresh mind, cause right now I even have doubts the ISTIO_METAJSON_STATS trick works.

@akosiaris I added some thoughts to T392886, lemme know what you found. My understanding is that there is a new annotation that we could use in recent versions of Istio (1.24 has it, but it cannot easily be backportable to our version afaics), I am very open to new roads to check!

May 26 2025, 2:37 PM · Patch-For-Review, envoy, serviceops, SRE Observability (FY2024/2025-Q4), Observability-Metrics
akosiaris added a comment to T394476: Onboard the Docker Registry to apus.

If you want to do some testing, I could set you up with a test account on apus.

May 26 2025, 9:43 AM · Ceph, SRE-swift-storage, Data-Persistence, serviceops
akosiaris added a comment to T394476: Onboard the Docker Registry to apus.

1 single bucket, at least at the beginning. Reading https://distribution.github.io/distribution/about/configuration/, I don't think the software can use more than 1 bucket anyway. We could in the future, assuming the PoC ends up being successful, to discuss splitting strategies to have >1 bucket. But even in that case, since that requires running an extra instance of the software, per my current understanding at least, we are limited to a small number, probably in the single digits.

May 26 2025, 9:42 AM · Ceph, SRE-swift-storage, Data-Persistence, serviceops

May 23 2025

akosiaris added a comment to T391333: Revisit default envoy histogram buckets.

I 've spent a good deal of time today to do what I assumed to be easy, that is perform the above in our ingressgateways. I have failed up to now. I 'll need to revisit this with a fresh mind, cause right now I even have doubts the ISTIO_METAJSON_STATS trick works.

May 23 2025, 2:05 PM · Patch-For-Review, envoy, serviceops, SRE Observability (FY2024/2025-Q4), Observability-Metrics
akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

After having to deal a bit with a staging-eqiad calico upgrade yesterday, I did find 1 thing that will break. This is a bit complex:

May 23 2025, 8:47 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic

May 22 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

I did run the simple one

May 22 2025, 3:25 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic

May 21 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

staging-eqiad with an MTU of 1460 as well.

May 21 2025, 9:08 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.
Long-term

We probably want to minimize our diff from upstream manifests in order to allow easier upgrades in the future. We can gradually move away from managing CNI in puppet now that upstream has support for this, making long term our lives easier.

May 21 2025, 7:45 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic

May 20 2025

akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.
nobody@wmfdebug:/$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0@if69: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ae:16:b8:c3:6d:75 brd ff:ff:ff:ff:ff:ff link-netnsid 0
May 20 2025, 1:59 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
akosiaris added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Unfortunately the 2 patches above didn't work. For ml-staging-codfw, just because it's still, via virtual of helmfile.d/admin_ng/values/common.yaml, locked to 0.2.10. However, it did not work either for staging-codfw because while the upstream manifests do indeed have support, that support is implemented via having the CNI config being managed by calico, a support that we did not import in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1112058, sticking with CALICO_MANAGE_CNI: false. We do manage this via puppet. It should be noted that our puppet implementation does allow to differentiate per cluster, same as the chart approach. There is no really functional difference between the 2 ways.

May 20 2025, 12:56 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic