Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Infrastructure Foundations

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Jan 5 2016, 9:54 PM (535 w, 4 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 10

elukey changed the status of T392886: Revisit default Istio histogram buckets from Stalled to Open.

I filed https://gerrit.wikimedia.org/r/1269998 as proposal for a conservative first step, to be applied to the ML clusters first (they produce a ton more time series than the others via Istio).

Fri, Apr 10, 10:41 AM · ServiceOps new, SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Metrics
elukey changed the status of T392886: Revisit default Istio histogram buckets, a subtask of T387350: liftwing SLO performance issues, from Stalled to Open.
Fri, Apr 10, 10:41 AM · SRE Observability (FY2024/2025-Q4), SRE-SLO, Observability-Metrics
elukey closed T414486: Upgrade AUX clusters to kubernetes 1.31, a subtask of T341984: Update Kubernetes clusters to 1.31, as Resolved.
Fri, Apr 10, 9:28 AM · Data-Platform-SRE (2026.01.05 - 2026.01.23), Epic, ServiceOps new, Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes
elukey closed T414486: Upgrade AUX clusters to kubernetes 1.31 as Resolved.
Fri, Apr 10, 9:28 AM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes

Thu, Apr 9

elukey added a comment to T421903: Investigate enabling gRPC in LiftWing model servers.

Hey! Adding a few notes/thoughts:

Thu, Apr 9, 3:18 PM · Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
elukey added a comment to T414486: Upgrade AUX clusters to kubernetes 1.31.

The only weird thing happened is that the Istio Gateway and the jaeger grpc collector failed to talk to each other via grpc due to TLS validation failures happening on both fronts. I "fixed" it via the following hot-fix in the Jaeger's Destination Rule:

Thu, Apr 9, 2:29 PM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey added a comment to T422819: ProbeDown (os-reports.wikimedia.org).

Something to follow up on - os-reports seems to be a CNAME for the aux ingress rw endpoint, only available in eqiad. Should we move it to the -ro one, that is active / active?

Thu, Apr 9, 1:12 PM · collaboration-services
elukey added a comment to T414486: Upgrade AUX clusters to kubernetes 1.31.

After the upgrade:

Thu, Apr 9, 1:08 PM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey added a comment to T414486: Upgrade AUX clusters to kubernetes 1.31.

Before the aux-eqiad upgrade:

Thu, Apr 9, 12:44 PM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey added a comment to T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot.

@BCornwall is there a specific downtime that you have in mind for the LVS servers? So we can have more context.. As Riccardo mentioned the Icinga "API" is not great, any chance that the downtime could become an Alertmanager one?

Thu, Apr 9, 8:19 AM · Infrastructure-Foundations, Traffic

Wed, Apr 8

elukey added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

Since the debmonitor intermediate expires before the discovery one, I'd propose to:

Wed, Apr 8, 2:25 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
elukey added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

I tested in the kafka upgrade pontoon environment the following:

Wed, Apr 8, 1:59 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
elukey reopened T419722: Experiment with new kserve version on ml-staging-codfw as "Open".

There is another thing to test imho before proceeding with prod, namely https://github.com/kserve/kserve/pull/3316

Wed, Apr 8, 9:06 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
elukey added a comment to T414486: Upgrade AUX clusters to kubernetes 1.31.

Post upgrade:

Wed, Apr 8, 8:41 AM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey added a comment to T414486: Upgrade AUX clusters to kubernetes 1.31.

Before the aux-codfw upgrade:

Wed, Apr 8, 7:38 AM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@jijiki shall we deploy mcrouter 2023.07.17.00-2 and test the keep alive options? I have the feeling that the TKOs will go down after it.

Wed, Apr 8, 7:25 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Tue, Apr 7

elukey created T422509: Cloud init and unattended upgrades while bootstrapping Trixie VMs.
Tue, Apr 7, 2:46 PM · Patch-For-Review, Cloud-VPS, cloud-services-team
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

It is interesting how the TKOs started to reduce from around the third/fourth of April.

Tue, Apr 7, 2:22 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Fri, Apr 3

elukey added a comment to T420475: Rework linters and checkers of I/F Python repositories for automation.

Filed a couple of changes to introduce ruff to spicerack and rework how linting/testing/docs run. I got down to ~60s local time and ~3m CI time (with tox creating venvs and installing deps), and ~19s while running locally with venvs already installed.

Fri, Apr 3, 8:57 AM · Patch-For-Review, Infrastructure-Foundations
elukey added a comment to T390215: Logstash is overwhelmed.

The istio-system namespace is logging ~980 events/sec. Many are just istio-ingressgateway for authority:page-analytics.discovery.wmnet (~833 events/sec).

Fri, Apr 3, 8:00 AM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Logging

Thu, Apr 2

elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney I added some info T420223#11753137, where I tested jitter seen by MTR on a worker in row A/B vs a worker in C/D: the former doesn't show it. I also tried on another couple of nodes, but I don't have anything definitive form a statistics point of view. I can collect more info if you want!

Thu, Apr 2, 8:55 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

Forgive the drive-by comment, but would it be possible to provision a net-new intermediate and have the service owners migrate to that intermediate? Not sure if it would be easier, just thought I'd throw it out as a suggestion.

Thu, Apr 2, 8:42 AM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review

Wed, Apr 1

elukey added a comment to T418929: Q4:rack/setup/install kafka-logging100[6-8].

The workaround in the last patch needs a spicerack change for ipmi, since we assume the root user:

Wed, Apr 1, 1:45 PM · Patch-For-Review, observability, SRE, ops-eqiad, DC-Ops

Tue, Mar 31

elukey updated subscribers of T414486: Upgrade AUX clusters to kubernetes 1.31.

Other services:

Tue, Mar 31, 2:08 PM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey updated subscribers of T414486: Upgrade AUX clusters to kubernetes 1.31.

Checked the services currently deployed (removed the ones that are not service-related):

Tue, Mar 31, 2:04 PM · ServiceOps new, Infrastructure-Foundations, Kubernetes, Prod-Kubernetes
elukey updated subscribers of T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

High level, this is what I have in mind:

Tue, Mar 31, 1:16 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
elukey added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

The X509v3 Subject Key Identifier changes between old and new intermediate certs (and it makes sense, new private key), so the current leaf discovery certs are not compatible with the new intermediate. I need to verify if https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate assumes that puppet's cfssl class will request a new cert when the .chain.pem file changes. If so, the key rotation should be handled transparently by puppet runs and restarts, if not we need to figure out how to fix this :D

Tue, Mar 31, 8:59 AM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review

Mon, Mar 30

elukey added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

This is becoming a little more complicated than what the wikitech page suggests: https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate

Mon, Mar 30, 4:39 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
elukey added a comment to T421386: cadvisor-reported Istio network usage is way too high.

In Prod we are working on reducing the buckets in T392886, now that we run a relatively recent version of Istio.

Mon, Mar 30, 4:04 PM · Toolforge, cloud-services-team (FY2025/2026-Q3-Q4)
elukey triaged T421348: Add tox-uv support to the tox-v{3,4} Docker images as Medium priority.
Mon, Mar 30, 2:52 PM · Release-Engineering-Team, Infrastructure-Foundations
elukey added a comment to T420993: Rotate discovery intermediate certificate (expires 2026-05-03).

Just to confirm:

Mon, Mar 30, 2:00 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
elukey added a comment to T421348: Add tox-uv support to the tox-v{3,4} Docker images.

@hashar Hi! Any thoughts about this? Could we work on it during the next quarter as Releng/SRE collaboration?

Mon, Mar 30, 1:51 PM · Release-Engineering-Team, Infrastructure-Foundations
elukey triaged T420993: Rotate discovery intermediate certificate (expires 2026-05-03) as High priority.
Mon, Mar 30, 1:50 PM · ServiceOps new, Infrastructure-Foundations, Patch-For-Review
elukey added a comment to T421461: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P).

I did some reading and my understanding is that with iommu=pt there is no protection from the kernel about unauthorized memory access that doesn't belong to a process, so in theory an attacker could:

  • compromise a model server running on a mi300x-enabled host (via vLLM etc...)
  • exploit a bug in the amdgpu kernel driver and basically get full access to memory on the host (completely compromised)
Mon, Mar 30, 7:42 AM · Machine-Learning-Team (Q4 FY2025-26), OKR-Work

Fri, Mar 27

elukey added a comment to T418929: Q4:rack/setup/install kafka-logging100[6-8].

I got another issue after my patch, namely the root user creation (in the BMC) returns a plain HTTP 400. I tried this from the spicerack shell:

Fri, Mar 27, 2:51 PM · Patch-For-Review, observability, SRE, ops-eqiad, DC-Ops
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney yes Effie depooled it IIRC!

Fri, Mar 27, 2:02 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey closed T416674: Test and upgrade Kafka clusters to Openjdk 17, a subtask of T416669: Upgrade Kafka to version 3.x, as Declined.
Fri, Mar 27, 11:20 AM · ServiceOps-Datastores, ServiceOps new, Infrastructure-Foundations, SRE
elukey closed T416674: Test and upgrade Kafka clusters to Openjdk 17 as Declined.
Fri, Mar 27, 11:20 AM · Infrastructure-Foundations, SRE
elukey closed T417035: Create a cookbook to execute Kafka rolling upgrades, a subtask of T416669: Upgrade Kafka to version 3.x, as Resolved.
Fri, Mar 27, 11:20 AM · ServiceOps-Datastores, ServiceOps new, Infrastructure-Foundations, SRE
elukey closed T417035: Create a cookbook to execute Kafka rolling upgrades as Resolved.
Fri, Mar 27, 11:20 AM · Infrastructure-Foundations, SRE
elukey closed T421226: Kartotherian dashboard links don't work as Resolved.

Updated the links, they now work :)

Fri, Mar 27, 9:10 AM · SRE, Maps, Sustainability (Incident Followup)
elukey renamed T419722: Experiment with new kserve version on ml-staging-codfw from Experiment with new kserve version on stagin to Experiment with new kserve version on ml-staging-codfw.
Fri, Mar 27, 8:51 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
elukey added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

@DPogorzelski-WMF as you prefer, but the difference between istio/kserve and knative is not really huge and I would personally upgrade first rather than keeping the old version. Also remember to check the new version of the knative chart too.

Fri, Mar 27, 7:59 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review

Thu, Mar 26

elukey closed T421350: The maps1012 postgres replica is broken as Resolved.

Repooled after maintenance. Doc added to https://wikitech.wikimedia.org/wiki/Maps/v2/Common_tasks#Fix_a_Broken_postgres_replica

Thu, Mar 26, 3:44 PM · Content-Transform-Team, Maps, Infrastructure-Foundations
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

These nodes should be depooled to test if tkos decrese considerably for a stable amount of time:

Thu, Mar 26, 2:06 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I retried the above experiment with an eqiad memcached shard:

Thu, Mar 26, 11:45 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T421348: Add tox-uv support to the tox-v{3,4} Docker images.

My understanding was that tox-uv needed to be selected in tox.ini, but after a chart with Riccardo I realized I was wrong, it is picked up automatically. We may want to create a new image for tox-v4-uv at this point, and opt-in from integration config?

Thu, Mar 26, 11:06 AM · Release-Engineering-Team, Infrastructure-Foundations
elukey created T421350: The maps1012 postgres replica is broken.
Thu, Mar 26, 10:59 AM · Content-Transform-Team, Maps, Infrastructure-Foundations
elukey created T421348: Add tox-uv support to the tox-v{3,4} Docker images.
Thu, Mar 26, 10:54 AM · Release-Engineering-Team, Infrastructure-Foundations
elukey added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

@DPogorzelski-WMF my understanding from reading upstream's commits is that they bump the k8s version every now and then and they set it to the version that it was tested in their testing env:

Thu, Mar 26, 10:37 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

The top mcrouter pod for TKOs is mcrouter-main-d7czx, running on wikikube-worker1070.eqiad.wmnet with IP 10.67.223.126.

Thu, Mar 26, 10:23 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I reworked a little the graph that shows TKOs registered by mcrouter pods, and I added a column to sort by total occurrences. The first 14 pods have their k8s workers in the following racks:

Thu, Mar 26, 9:53 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

It is very interesting what happened directly after the depool of codfw for the MW Switchover - the total number of mcrouter rps in eqiad jumped to 1.3M/s (!!) and the tkos raised somehow proportionally, peaking now at 7/8k rps. We are still around the 0.5% of requests ending up in TKO, so I am more and more convinced that this may be an issue of connection recycling that may cause this issue. Why it doesn't happen the same in codfw is not clear to me yet, but my proposal is to test the settings that I mentioned above as next step:

Thu, Mar 26, 9:28 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Wed, Mar 25

elukey added a comment to T420039: In the function-evaluator and function-orchestrator, when we emit a HTTP 504, ensure we log this for follow-up.

@ecarg I am trying to get the new 504 logs via https://logstash.wikimedia.org/goto/82a1feda83be4ad00e9c24b95268c329, is it what we are expecting? I don't have a lot of context, so I cannot judge if there is a clear motivation why the 504 happened, so my question is: are now the logs actionable? For example, from those would you be able to understand why the 504 happened? I am asking since we'll need to get to the root cause of those 504s, and these logs should hopefully tell us how. If it is not the case: is there anything that we can do to improve them to give us more info?

Wed, Mar 25, 4:56 PM · Abstract Wikipedia team (26Q3 (Jan–Mar)), Essential-Work, function-orchestrator, function-evaluator
elukey closed T358189: aux-k8s cluster prometheus setup is incomplete, a subtask of T321211: distributed tracing v1: tech debt blockers, as Resolved.
Wed, Mar 25, 3:26 PM · Observability-Tracing, Epic
elukey closed T358189: aux-k8s cluster prometheus setup is incomplete as Resolved.

Should be good now!

Wed, Mar 25, 3:26 PM · Infrastructure-Foundations, Observability-Tracing
elukey added a comment to T415067: Wire up Abstract Wiki's second SLO indicator: Integration combined latency-availability.

@Jdforrester-WMF Before closing let's verify your assumption, ideally we should have some way of checking what's happening in dashboards or similar. I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1260731, IIUC you are saying that Wikifunctions being slow and downsized may play a role, or is it just the MediaWiki part that you are concerned with? Or both? :D

Wed, Mar 25, 3:22 PM · Abstract Wikipedia team (26Q3 (Jan–Mar)), Essential-Work
elukey added a comment to T420475: Rework linters and checkers of I/F Python repositories for automation.

After a chat with Riccardo prospector is used in wmflib only for pyroma and vulture, and the latter is disabled for cookbooks, so now I better understand Federico's intent.

Wed, Mar 25, 3:03 PM · Patch-For-Review, Infrastructure-Foundations
elukey added a comment to T420475: Rework linters and checkers of I/F Python repositories for automation.

I tried to compare what we did with wmflib with Federico's proposal, to try to summarize a possible standard approach. The differences are:

Wed, Mar 25, 2:47 PM · Patch-For-Review, Infrastructure-Foundations
elukey closed T393053: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet as Resolved.
Wed, Mar 25, 9:10 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops

Tue, Mar 24

elukey added a comment to T419734: RfC: Use of gRPC as Lambda interface for linked artifact caching.

I'm not sure I understand the comment about needing an http replica, but frankly, what stops you from running your "service" (which is just a lambda) with a sidecar of the linked artifact cache to provide the http interface?

Tue, Mar 24, 10:27 AM · User-Eevans, Data-Persistence
elukey triaged T420978: Move the Docker Registry's /ml prefix to S3/apus as Medium priority.
Tue, Mar 24, 10:10 AM · Ceph, SRE-swift-storage, Infrastructure-Foundations, Machine-Learning-Team
elukey added a comment to T393053: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet.

To keep archives happy - I used the following workaround in provisioning and it worked:

Tue, Mar 24, 10:06 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops

Mon, Mar 23

elukey added a subtask for T390251: docker-registry.wikimedia.org keeps serving bad blobs: T420978: Move the Docker Registry's /ml prefix to S3/apus.
Mon, Mar 23, 5:44 PM · ServiceOps new
elukey added a parent task for T420978: Move the Docker Registry's /ml prefix to S3/apus: T390251: docker-registry.wikimedia.org keeps serving bad blobs.
Mon, Mar 23, 5:44 PM · Ceph, SRE-swift-storage, Infrastructure-Foundations, Machine-Learning-Team
elukey created T420978: Move the Docker Registry's /ml prefix to S3/apus.
Mon, Mar 23, 5:44 PM · Ceph, SRE-swift-storage, Infrastructure-Foundations, Machine-Learning-Team
elukey added a comment to T414216: Q3:rack/setup/install dse-k8s-worker10[20-23].

@Jclark-ctr all hosts provisioned! The new cookbook is not merged, but I thought to unblock you :)

Mon, Mar 23, 4:10 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Something interesting while checking mcrouter stats (the full list, not the ones exposed via the exporter):

Mon, Mar 23, 3:39 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey triaged T420439: Migrate AUX k8s apiserver and services to IPIP as Medium priority.
Mon, Mar 23, 2:54 PM · Infrastructure-Foundations, Prod-Kubernetes, Kubernetes, Liberica, Traffic
elukey reopened T393053: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet as "Open".

Re-opening this one since something weird happens when running provisioning:

Mon, Mar 23, 11:34 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

The gutter pool in eqiad matches the December pattern, due to tkos: dashboard

Mon, Mar 23, 9:41 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I've reworked a little https://grafana-rw.wikimedia.org/d/ltSHWhHIk/mw-mcrouter to use irate() everywhere and not a mix of increase and irate, to more easily compare panels. In eqiad we roughly handle 400 rps and we have a background of 1 or 2k requests / s tko-ed, that is around 0.5%. I think that this is a great example of use case where an SLO would be perfect to understand the urgency of this task, since we are talking about 99.5+ SLO target for availability. I am not suggesting to not work on this task, but its priority may need to vary since the impact of the problem is not big at the moment.

Mon, Mar 23, 9:25 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Fri, Mar 20

elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.
Fri, Mar 20, 10:59 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney one thing worth to look is if we added QOS or similar changes over the network in response to the traffic attacks that happened in December. I found T412785 but the timeline doesn't match 100% to the timings provided by the metrics (started around 16th Dec ~13:50 more or less). mcrouter is configured to SET certain keys to both eqiad and codfw, waiting for both before returning success. We also have a 250ms timeout of cross-region calls like mcrouter in eqiad reaching codfw, so anything that slows down may start some timeout errors. There is nothing that lines up in puppet or sal that I can see, or even k8s deployments.

Fri, Mar 20, 9:21 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Thu, Mar 19

elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I thought the same but if you count 1661 / 22 you get ~75 errors for each hour, that is not ideal but not even a ton. Ideally it shouldn't be there, but I do see some TCP RST sent periodically by the memcached shards to the mcrouter pods (via tcpdump) so that my correlate.

Thu, Mar 19, 3:15 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

After some back and forth with Cathal on IRC, we didn't find anything that could point to accept() being problematic.

Thu, Mar 19, 10:52 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

In general this worries me a little:

Thu, Mar 19, 9:48 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@jijiki I checked the thread, and I think it is a separate issue. Yesterday I found occurrences of accept4(): Resource temporarily unavailable, meanwhile the thread mentions people stracing memcached and seeing read() EAGAIN. The former means that memcached wasn't able to create a new socket via accept, the latter that no data was read (that is expected in async code). This is why I am worried - accept() not being able to work is usually a sign of distress of the network stack, there is probably a bottleneck somewhere.

Thu, Mar 19, 9:18 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420507: MI300 machines need startup tweaks.

Worth to remember - amd-smi is currently relying on a manual fix in the Python files due to https://github.com/ROCm/amdsmi/pull/136, since we were waiting for the rocm 7.2.0 release (afaics it should contain the fix). Without it any reimage would bring up the wrong version of the libs, and partitioning becomes impossible.

Thu, Mar 19, 8:17 AM · Machine-Learning-Team, Patch-For-Review, Essential-Work

Wed, Mar 18

elukey created T420475: Rework linters and checkers of I/F Python repositories for automation.
Wed, Mar 18, 2:49 PM · Patch-For-Review, Infrastructure-Foundations
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney Check my update above, there are a couple of IPs that we could use to verify if anything is weird in their path. It is difficult from the mcrouter logs to understand what IP is related to a timeout, but I do see RSTs when trying to connect from eqiad to codfw via TLS. Memcached also shows a high number of failed TLS transactions, we don't have it as metric so it may be something unrelated, but it smells like something problematic.

Wed, Mar 18, 2:21 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.
sudo lsof -p 553977 | wc -l
4842
Wed, Mar 18, 10:42 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I tried to isolate a connection between a single mcrouter pod in eqiad (10.67.189.81 on wikikube-worker1270.eqiad.wmnet) and mc2039 (10.192.0.22) in codfw:

Wed, Mar 18, 10:28 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Tue, Mar 17

elukey added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Saw the task passing by and got nerd-sniped :)

Tue, Mar 17, 5:09 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
elukey added a comment to T414216: Q3:rack/setup/install dse-k8s-worker10[20-23].

@Jclark-ctr I provisioned dse-k8s-worker1020 with an experimental provisioning cookbook, when you have a moment could you please tell me if everything looks good?

Tue, Mar 17, 4:53 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops

Mon, Mar 16

elukey closed T420034: deployment-kafka-logging01 is down for maintenance because Trixie is not yet well supported as Resolved.
Mon, Mar 16, 4:58 PM · Beta-Cluster-Infrastructure
elukey closed T420083: Create Java 21 security config in puppet, a subtask of T420034: deployment-kafka-logging01 is down for maintenance because Trixie is not yet well supported, as Resolved.
Mon, Mar 16, 4:27 PM · Beta-Cluster-Infrastructure
elukey closed T420083: Create Java 21 security config in puppet as Resolved.
Mon, Mar 16, 4:27 PM · Infrastructure-Foundations
elukey added a comment to T420034: deployment-kafka-logging01 is down for maintenance because Trixie is not yet well supported.

Kafka 3.7 is running on deployment-kafka-logging01 with Debian Trixie, first one of its kind!

Mon, Mar 16, 4:08 PM · Beta-Cluster-Infrastructure
elukey added a comment to T414216: Q3:rack/setup/install dse-k8s-worker10[20-23].

This requires more work, since those models are X13 of a very new generation that don't accept BIOS updates to /redfish/v1/Systems/1/Bios anymore, that has become a read-only API. The new endpoint is /redfish/v1/Systems/1/Bios/SD and it uses a series of keys that are different from the previous endpoint. We'll have to add everything to the provision cookbook, like we did when we introduced the new dells with idrac10. It will likely take one week :(

Mon, Mar 16, 3:12 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, ops-eqiad, DC-Ops
elukey triaged T419967: Add --min-uptime to cookbooks as Low priority.
Mon, Mar 16, 2:22 PM · SRE-tools, serviceops-radar, Infrastructure-Foundations
elukey triaged T420083: Create Java 21 security config in puppet as Medium priority.
Mon, Mar 16, 2:21 PM · Infrastructure-Foundations
elukey closed T419040: kserve helm status is broken across ml clusters as Resolved.
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kube-env admin ml-serve-codfw
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# helm3 -n kserve history kserve
REVISION        UPDATED                         STATUS          CHART           APP VERSION     DESCRIPTION                                                                                                                                                                                                                                                                                                                                                    
1               Mon Feb 23 17:00:58 2026        superseded      kserve-0.2.9    0.11.2          Install complete                                                                                                                                                                                                                                                                                                                                               
2               Tue Mar  3 08:31:38 2026        superseded      kserve-0.2.9    0.11.2          Upgrade "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block 
3               Tue Mar  3 08:31:43 2026        failed          kserve-0.2.9    0.11.2          Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block
4               Mon Mar 16 09:51:44 2026        deployed        kserve-0.3.0    0.11.2          Upgrade complete
Mon, Mar 16, 9:52 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T282545: Sensible updates of java.security properties.

I remember that Traffic suggested the use of hardened TLS settings to allow the use case of mTLS when pushing webrequest data to Kafka Jumbo, and we probably wanted to apply it everywhere if mTLS was needed elsewhere (or maybe I misremember).

Mon, Mar 16, 9:49 AM · Infrastructure Security, User-MoritzMuehlenhoff, SRE
elukey added a comment to T419040: kserve helm status is broken across ml clusters.

Deployed on ml-serve-eqiad:

Mon, Mar 16, 9:35 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T418160: wikifunctions-backend-combined-v1 SLI error budget has been rapidly dropping over Feb 2026.

I followed up again today and created https://wikitech.wikimedia.org/wiki/User:LToscano_(WMF)/AbstractWikipedia#Dashboards to collect useful debugging tips and dashboard links in one place. The Istio logs seem to indicate that the Orchestrator returns HTTP 504s because its proxied cluster, the evaluators, return 504 as well (no indication of timeouts happening etc..).

Mon, Mar 16, 9:31 AM · Essential-Work, ServiceOps new, Abstract Wikipedia team, SRE-SLO
elukey added a comment to T420039: In the function-evaluator and function-orchestrator, when we emit a HTTP 504, ensure we log this for follow-up.

Logstash filters for:

Mon, Mar 16, 9:28 AM · Abstract Wikipedia team (26Q3 (Jan–Mar)), Essential-Work, function-orchestrator, function-evaluator
elukey added a comment to T282545: Sensible updates of java.security properties.

Very interesting, from the link it seems that we could create an override file simply adding something like -Djava.security.properties=/etc/sysconfig/jvm.java.security to the target service's JAVA_OPTS. The only downside that I see is that we'll need to explicitly inject those options in every specific Java deployment, otherwise we'll get the standard security settings shipped with Debian. Maybe it is fine, I am not sure what is our policy (do we need to apply the custom security settings to all JVM deployments or just to say CAS/Kafka/etc..?)

Mon, Mar 16, 8:32 AM · Infrastructure Security, User-MoritzMuehlenhoff, SRE

Sat, Mar 14

elukey added a comment to T420041: db1253 depooled following host crash.

Nothing in racadm getsal, nothing in mariadb's journalctl and dmesg, so I am inclined to mark this as an hardware-related stall (when I tried to ssh via serial console the console com2 was a blank screen).

Sat, Mar 14, 6:10 PM · DBA
elukey updated subscribers of T420034: deployment-kafka-logging01 is down for maintenance because Trixie is not yet well supported.

Next steps:

  • Add the new java-21 security config to puppet - T420083
  • Rollout @taavi's patches to fix facts in kafka classes to be ready for Puppet 8.
  • Fix remaining issues with Kafka on Trixie/Java 21
Sat, Mar 14, 9:00 AM · Beta-Cluster-Infrastructure