Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Infrastructure Foundations

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
Jan 5 2016, 9:54 PM (517 w, 3 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Yesterday

elukey closed T411857: Staging kartotherian fails with "upstream request timeout" as Resolved.

Perfect :)

Fri, Dec 5, 2:38 PM · serviceops, Maps
elukey added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

@akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could be feasible? I can take care of kicking it off and finding volunteers (sounds like me and Scott are already in :D).

Fri, Dec 5, 2:16 PM · Patch-For-Review, serviceops
elukey added a comment to T411857: Staging kartotherian fails with "upstream request timeout".
elukey@deploy2002:~$ curl -k "https://10.64.70.93:6543/img/osm-intl,12,31.807,34.673,355x390.png?lang=he&domain=he.wikipedia.org&title=%D7%9E%D7%92%D7%93%D7%9C%D7%99_K&revid=40604675&groups=_8afac0e44a6b6826ba8fab43e1d72cab6b566ebd"
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.
Fri, Dec 5, 1:43 PM · serviceops, Maps
elukey added a comment to T411857: Staging kartotherian fails with "upstream request timeout".

I see a lot of {"level":"error","timestamp":"2025-12-05T09:35:50.078Z","message":"error marshalling tile: ERROR: permission denied for table planet_osm_polygon_landuse_gen_z6 in the tegola's logs, we probably have a db password problem, lemme check.

Fri, Dec 5, 9:58 AM · serviceops, Maps
elukey added a comment to T411774: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton.

Hi @elukey!
We don't need this often to be honest, maybe it's more about being able to help the DP SRE team with small tasks rather than giving them more work, but I totally understand the concerns.

In this case I was working on a simple ticket to increase partitions on a topic we use in Kafka Jumbo.
Some time ago I wanted to help with topicmapr but the SRE team is on it already.

I usually work on our "Event Platform" tickets and I'm not sure if we'll need similar operations often. I can imagine small tasks about changing configs, like retention, compression, partitions... I already have a way of connecting to the Kafka Jumbo to explore topics, consumer groups, messages and so on, but altering a topic is denied.
At some point we'd like to work on a way of configuring topics too.

Maybe adding some ACLs to the cluster could be another approach? It won't require ssh access to the brokers. I could explore that if it helps.

Fri, Dec 5, 9:55 AM · Data-Platform-SRE, Infrastructure-Foundations, Patch-For-Review

Thu, Dec 4

elukey added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

My understanding from the last updates is that we are not actively pushing anything to the new registry with apus as backend, we have just quickly tested some months ago. Is it the right understanding?

Thu, Dec 4, 4:28 PM · Patch-For-Review, serviceops
elukey added a comment to T406392: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid.

It's a long road to migrating the registry from Swift to apus Ceph as the long-term solution for T390251, and even that is currently only focused on the /restricted prefix of the namespace (though that could change).

Thu, Dec 4, 4:25 PM · serviceops, GitLab (CI & Job Runners)
elukey added a comment to T411774: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton.

@JMonton-WMF Hi! I have used the kafka tools like topic mapper in the past and if not handled correctly (like throttling etc..) they can have nasty side effects, like causing bandwidth usage problems. If there is any maintenance in progress from SRE and those tools are used, we can have other unwanted side effects too.

Thu, Dec 4, 3:24 PM · Data-Platform-SRE, Infrastructure-Foundations, Patch-For-Review
elukey added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

ml-serve1013 has been added as k8s worker with the necessary taints to avoid regular pods to run on it by accident.

Thu, Dec 4, 2:40 PM · Machine-Learning-Team
elukey added a comment to T408884: Kartographer map labels for place names with ZWNJ character (U+200C) are rendered as white rectangular boxes.

@Pikne wow good to know thanks! I am planning to create an endpoint for testing (likely maps-staging.wikimedia.org), that the community will be able to use to test changes like this one before hitting production. So I hope that in the future we'll be able to use the staging stack to carefully review a change before hitting real traffic (fingers crossed). I'll keep you in the loop when/if my patch will be scheduled on staging.

Thu, Dec 4, 10:29 AM · Essential-Work, Patch-For-Review, Content-Transform-Team, Maps (Kartotherian)

Wed, Dec 3

elukey added a comment to T345627: Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. .

@Mvolz ahhh ok thanks for the explanation! I rechecked the graph and it shows a neat recovery towards 0%, it is still in the negative but I am confident that it will steadily progress towards a healthy/green error budget as the time passes. Let's keep it open and re-eval before the holiday break!

Wed, Dec 3, 5:08 PM · Editing-team (Tracking), SRE-SLO, VisualEditor, Citoid
elukey added a comment to T410835: ErrorBudgetBurn.

In the Pyrra dashboard I see a big hole during November, including the 23rd when the alert fired:

Wed, Dec 3, 3:54 PM · Test Kitchen (Experiment Platform Sprint 16)

Tue, Dec 2

elukey added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

mmm but is allocatable something that varies dynamically? Probably not, if so everything seems working fine. Or am I missing anything?

Tue, Dec 2, 3:47 PM · Machine-Learning-Team
elukey added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

I checked the Allocated resources for ml-serve1009, where we run the revise-tone-task pod on a GPU, and I see the following:

Tue, Dec 2, 3:45 PM · Machine-Learning-Team
elukey added a comment to T410075: Discovery of Cassandra cluster nodes.

@Eevans I am reasoning out loud, so no need to be sorry, thanks for the follow ups :) The external service should be basically a sort of software LB for hosts outside k8s, so you can contact them from k8s simply starting from a common name. I've read a bit of code and discovered stuff like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/953283, my worry was that we had to duplicate the cassandra instance hostnames in the egress rules and in the initial hostnames list (for discovery), but it seems that we have a workaround in place.

Tue, Dec 2, 3:34 PM · Data-Persistence, SRE, Cassandra
elukey added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

My bad, the GPU is there:

Tue, Dec 2, 8:39 AM · Machine-Learning-Team
elukey added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

yeah I think amd.com/gpu: 1 wasn't added when deploying aya, only tolerations, that would explain the result..

Tue, Dec 2, 8:35 AM · Machine-Learning-Team

Mon, Dec 1

elukey triaged T410944: Reboot cookbook workflow leaves Puppet disabled as Medium priority.
Mon, Dec 1, 3:31 PM · Traffic, Infrastructure-Foundations, SRE-tools, SRE
elukey added a comment to T395939: Request additional access for Dcops group .

@MoritzMuehlenhoff what we may need to do is to move all disk/partition/raid/etc.. commands from datacenter-ops to ops-limited, what do you think?

Mon, Dec 1, 2:53 PM · SRE, Infrastructure-Foundations
elukey added a comment to T393948: Q4:rack/setup/install ml-serve101[2345].

To keep archives happy - the ml-serve1012 and 1013 hosts have been removed from the analytics vlan.

Mon, Dec 1, 10:06 AM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

High level summary: while reviewing the Pyrra's availability graphs with the Abstract Wikipedia team we noticed several things that didn't make sense, like short severe drops affecting the downstream error budget calculations as well. After an investigation with Observability, it seems that Thanos, the system that Pyrra uses to create efficient and more compact time series / recording rules from the SLI metrics, has some consistency issues with its internal caching and sometimes it ends up storing the wrong datapoints/values in its long term storage.
The Wikifunction's SLI metrics seem to be the most affected ones, we are going to investigate the issue and report back a more permanent solution. Sadly the old time series, already in the Thanos long term storage, cannot be easily refreshed/overwritten so history may need to remain like it is.

Mon, Dec 1, 8:55 AM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

This is a great milestone! Thanks a lot for the work Kevin :)

Mon, Dec 1, 8:30 AM · Machine-Learning-Team

Thu, Nov 27

elukey added a comment to T408884: Kartographer map labels for place names with ZWNJ character (U+200C) are rendered as white rectangular boxes.

After a chat with Yiannis, the following commit was highlighted https://github.com/wikimedia/osm-bright.tm2/commit/e5b7a05b692199238ce54eb8e2e879331256d7ae. We added it while transitioning to the new kartotherian/mapnik versions, but we lost the context on what it was meant to resolve. The fonts are brought in by the osm-bright.tm2 nodejs package, so https://gerrit.wikimedia.org/r/c/mediawiki/services/kartotherian/+/1201020/ is not worth pursuing anymore. It may be worth to test a new kartotherian version with e5b7a05b692199238ce54eb8e2e879331256d7ae reverted, we now have a fully working staging environment that could be used for it.

Thu, Nov 27, 3:30 PM · Essential-Work, Patch-For-Review, Content-Transform-Team, Maps (Kartotherian)
elukey added a comment to T394778: Build and push images to the docker registry from ml-lab.

+1 on all, seems a good plan, not sure if a higher level approach for blocking registries compared to iptables is available, but that is something that can be investigated later on.

Thu, Nov 27, 2:57 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T410075: Discovery of Cassandra cluster nodes.

Well, like I mentioned above, (among other things) it avoids a significant number of network hops by not relying on randomly selected nodes to forward to one in the replica set. It's definitely doing things that would be unrealistic to expect of a load balancer.

Again, it's probably possible to configure each client (or perhaps in some cases, to implement a custom routing/load-balancing policy) to eschew this behavior, but a) we'd be giving up those optimizations, b) we'd have to touch/test a bunch of different code in different projects using different library implementations, and c) we'd have to support a solution that isn't idiomatic, in perpetuity. I can be skeptical about the benefits of (a), but I'm quite confident that (b) and (c) would be really painful. :)

Thu, Nov 27, 11:05 AM · Data-Persistence, SRE, Cassandra
elukey added a comment to T409528: Setup a maps staging DB.

Next steps:

Thu, Nov 27, 10:09 AM · Patch-For-Review, SRE-Unowned, Maps, SRE
elukey added a comment to T394778: Build and push images to the docker registry from ml-lab.

@DPogorzelski-WMF I think the plan is good, I have only a few further questions:

Thu, Nov 27, 10:08 AM · Patch-For-Review, Machine-Learning-Team

Wed, Nov 26

elukey added a comment to T410075: Discovery of Cassandra cluster nodes.

At this point another alternative for the k8s world could be to have an externalservice configured, so that clients will use it to connect to random host and discover the full list. We'll still need to have egress policies for all nodes, but at least clients won't need to specify cassandra instance hostnames (but only the externalservice endpoint).

Wed, Nov 26, 3:05 PM · Data-Persistence, SRE, Cassandra
elukey added a comment to T410075: Discovery of Cassandra cluster nodes.

@Eevans thanks for the explanation, I kinda assumed that a query to any of the cassandra nodes would have worked as-is, routing the request to the right node (if needed) behind the scenes. IIUC you are saying that whatever endpoint is provided to connect to a cassandra, it is used to retrieve the list of nodes and then the client picks up the right IP address based on $routing policy. If this is true I am a little puzzled, it seems a really big over-complication to avoid a load balancer, but it defeats the purpose of an LVS endpoint yes.

Wed, Nov 26, 2:59 PM · Data-Persistence, SRE, Cassandra
elukey added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Very weird. In theory http://outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local/v1/models/outlink-topic-model:predict should set the Host header to outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local behind the scenes (at least any HTTP client should do it). Maybe something specific in the Python code? We can try to inspect istio-proxy logs and see what is the reason for the 502 (usually there is a note about it), but I don't want to slow down the task if it is already working. It would be nice to figure out what's happening :D

Wed, Nov 26, 2:43 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

@elukey @klausman @akosiaris

Thank you for all of your help investigating and finding the solution to enable the pod-to-pod communication!
I'm very happy to confirm that the solution Luca suggested works and is already integrated in our production service. We use a combination of http://outlink-topic-model.articletopic-outlink/v1/models/outlink-topic-model:predict as URL and outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local as Host header to communicate with the service.

This is amazing as it's the first instance of cross-service communication on LiftWing cluster and it enables us to efficiently query other models within the cluster now! 🎉

Wed, Nov 26, 2:26 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

@herron this task should be good in my opinion for the pilot's goals, we'll may need to tune it a little further if we decide to use Sloth but I wouldn't spend a ton of time on it in Q2. Lemme know!

Wed, Nov 26, 2:17 PM · SRE-SLO
elukey added a comment to T411082: Remove old GPUs from ml-serve1001.

Next steps:

Wed, Nov 26, 2:13 PM · SRE, DC-Ops, ops-eqiad, Machine-Learning-Team
elukey added a comment to T411082: Remove old GPUs from ml-serve1001.

The host is depooled:

Wed, Nov 26, 2:09 PM · SRE, DC-Ops, ops-eqiad, Machine-Learning-Team
elukey added a comment to T345627: Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. .

@Mvolz all merged, the new dashboard is available here, but it seems that we are seeing a lot of errors anyway :(

Wed, Nov 26, 2:07 PM · Editing-team (Tracking), SRE-SLO, VisualEditor, Citoid
elukey added a comment to T411082: Remove old GPUs from ml-serve1001.

Additionally, this will make four Radeon PRO WX 9100 GPUs in storage. Should we consider selling them if they’re no longer well supported?

Wed, Nov 26, 1:58 PM · SRE, DC-Ops, ops-eqiad, Machine-Learning-Team
elukey created T411082: Remove old GPUs from ml-serve1001.
Wed, Nov 26, 10:34 AM · SRE, DC-Ops, ops-eqiad, Machine-Learning-Team
elukey added a comment to T410075: Discovery of Cassandra cluster nodes.

Hi folks!

Wed, Nov 26, 8:19 AM · Data-Persistence, SRE, Cassandra

Tue, Nov 25

elukey added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

Updated deletion list:

Tue, Nov 25, 4:51 PM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
elukey placed T351731: Turnilo: invalid transforms on wmf_netflow dashboard up for grabs.
Tue, Nov 25, 4:36 PM · Data-Platform-SRE
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

Today I tried to take a look to spikes like the following, shown in Grafana by Pyrra metrics:

Tue, Nov 25, 2:47 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T357756: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website.

@jcrespo sadly the upstream website changed and the way that we used to get the latest firmware doesn't work anymore. The only supported/working way is to stage the firmwares manually on the cumin nodes and use those :(

Tue, Nov 25, 2:36 PM · User-Elukey, Infrastructure-Foundations, DC-Ops, SRE-tools
elukey updated subscribers of T410075: Discovery of Cassandra cluster nodes.

Looping in also @BTullis and @brouberol for a quick high level discussion, since AQS will be probably the first cluster to target :)

Tue, Nov 25, 10:01 AM · Data-Persistence, SRE, Cassandra

Mon, Nov 24

elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

We've now deployed with the slightly reduced timeout. I hope to see the SLO number at 100% in the coming days, but let's see.

@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?

I believe it's a 504.

Mon, Nov 24, 5:00 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a project to T408884: Kartographer map labels for place names with ZWNJ character (U+200C) are rendered as white rectangular boxes: Content-Transform-Team.
Mon, Nov 24, 4:42 PM · Essential-Work, Patch-For-Review, Content-Transform-Team, Maps (Kartotherian)
elukey added projects to T410075: Discovery of Cassandra cluster nodes: SRE, Data-Persistence.
Mon, Nov 24, 1:47 PM · Data-Persistence, SRE, Cassandra
elukey added a comment to T410075: Discovery of Cassandra cluster nodes.

@Eevans my proposal would be to evaluate the possibility of having an LVS endpoint in front of all clusters. This would allow the following:

Mon, Nov 24, 11:06 AM · Data-Persistence, SRE, Cassandra
elukey added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

I always forget that istioctl is always a good friend :)

Mon, Nov 24, 10:25 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

IIUC Knative just sits behind the kube svcs for the inference services to provide autoscaling-like services/buffering etc.. It shouldn't influence the routing, in theory :D

Mon, Nov 24, 10:09 AM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

IIUC Knative just sits behind the kube svcs for the inference services to provide autoscaling-like services/buffering etc.. It shouldn't influence the routing, in theory :D

Mon, Nov 24, 9:54 AM · Patch-For-Review, Machine-Learning-Team
elukey closed T409271: Airflow image on DSE failing to get inspected by Debmonitor as Resolved.

All good thanks!

Mon, Nov 24, 9:13 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)
elukey added a comment to T409271: Airflow image on DSE failing to get inspected by Debmonitor.

@brouberol I think that the image is still running in a few places:

Mon, Nov 24, 8:55 AM · Essential-Work, Data-Platform-SRE (2025.11.07 - 2025.11.28)

Thu, Nov 13

elukey updated subscribers of T345627: Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. .

Really nice! I'll be afk next week for holidays, but @RLazarus may be able to follow up in the meantime :) Otherwise I'll pick it up when I am back!

Thu, Nov 13, 4:53 PM · Editing-team (Tracking), SRE-SLO, VisualEditor, Citoid
elukey reopened T398869: Create Pyrra SLOs for xLab, a subtask of T398229: FY25-26 SDS2.1.3 Reliability - Production Monitoring, as Open.
Thu, Nov 13, 4:44 PM · Test Kitchen (Experiment Platform Sprint 16), Epic
elukey reopened T398869: Create Pyrra SLOs for xLab, a subtask of T382153: Define Product-Level Service Level Objectives (SLOs) for Experimentation Lab, as Open.
Thu, Nov 13, 4:44 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Test Kitchen (Experiment Platform Sprint 14)
elukey reopened T398869: Create Pyrra SLOs for xLab as "Open".

Let's keep it open until the alerts are up :)

Thu, Nov 13, 4:44 PM · SRE-SLO, Test Kitchen (Experiment Platform Sprint 14), OKR-Work
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?

Thu, Nov 13, 4:30 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T409528: Setup a maps staging DB.

After a chat with Moritz we realized that the better path is probably to create another account for staging, and create the new container in there. In this way we fully disentangle prod from staging, and we don't risk to mess up prod tiles when working in staging.

Thu, Nov 13, 4:12 PM · Patch-For-Review, SRE-Unowned, Maps, SRE
elukey created P85318 aya failures on ml-serve-eqiad.
Thu, Nov 13, 3:50 PM
elukey added a comment to T409528: Setup a maps staging DB.

Created a new bucket with swift post and the Tegola AUTH credentials on thanos-fe1004:

Thu, Nov 13, 2:55 PM · Patch-For-Review, SRE-Unowned, Maps, SRE
elukey added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

[ ... ]

@Eevans Hi! Is there a load balancing endpoint in front of the cassandra nodes, or should we randomly pick one to connect to?

There is not, and if you choose just one random node, you risk connection failures on a restart (when there is a failure, or when the node gets refreshed, etc). What we do elsewhere (in more than once place), is use the full list of nodes, try really hard to keep them up to date, fail, but (somehow) narrowly avoid problems by virtue of it being a long list. Obviously this is terrible. 😢

Thu, Nov 13, 11:00 AM · Machine-Learning-Team

Wed, Nov 12

elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

@elukey No, we do not have a number in mind. Our approach is going to be to iterate on those metrics (not just that one), but @Jdforrester-WMF can correct me if I am mistaken.

Wed, Nov 12, 3:25 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

Exactly yes, it is here. We also have an extra envoy timeout set to 15s IIUC as extra fence, but the one that counts is the orchestrator's one.

So, this is still very unclear to me. If I understand correctly, there are two options:

  • we reduce the timeout you've linked here to 9700ms, OR
  • somehow, mediawiki_WikiLambda_mw_to_orchestrator_api_call_seconds_bucket gets changed and the 10s bucket becomes 10.2s or something.

The former is very easy for us to handle on our side. Is that the ideal path forward, or is it possible (and desirable) to do the latter?

Wed, Nov 12, 3:23 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T409657: Revertrisk multilingual predictor returning 500s.

The client seems to be a single one from the Istio Ingress logs, but the URI seemed to be /v1/models/revertrisk-multilingual:predict so it may make sense that they tried to query the service with a JSON payload that triggered the problem.

Wed, Nov 12, 2:29 PM · Essential-Work, Machine-Learning-Team
elukey added a comment to T409657: Revertrisk multilingual predictor returning 500s.

@gkyziridis I am still puzzled by the main exception:

Wed, Nov 12, 2:14 PM · Essential-Work, Machine-Learning-Team
elukey updated subscribers of T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

@DSantamaria while we decide the best approach, it would be great to also discuss the path that we'll take after this first round of configurations. IIUC from our last discussion the 10s bucket is only the first step, to then find a more suitable/realistic target for the SLO (that wouldn't be always report an error budget of 100%). Does AW have a number in mind? What steps are we going to take to find the right value?

Wed, Nov 12, 9:49 AM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

Exactly yes, it is here. We also have an extra envoy timeout set to 15s IIUC as extra fence, but the one that counts is the orchestrator's one.

Wed, Nov 12, 9:45 AM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey updated subscribers of T408884: Kartographer map labels for place names with ZWNJ character (U+200C) are rendered as white rectangular boxes.

@Nikki great investigation and datapoints, thanks a lot!

Wed, Nov 12, 9:18 AM · Essential-Work, Patch-For-Review, Content-Transform-Team, Maps (Kartotherian)

Tue, Nov 11

elukey added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.

Tue, Nov 11, 5:16 PM · SRE-SLO
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

@cmassaro I think it is probably something in the docker image / WF service itself, I haven't found a k8s configuration that triggers the 10s timeout yet.

Tue, Nov 11, 4:46 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T408884: Kartographer map labels for place names with ZWNJ character (U+200C) are rendered as white rectangular boxes.

I tried to look at the code to understand how one would go about fixing this and got hopelessly lost in a maze of abandoned repositories without finding anything.

Tue, Nov 11, 4:00 PM · Essential-Work, Patch-For-Review, Content-Transform-Team, Maps (Kartotherian)
elukey added a comment to T409411: Review Druid changelogs before upgrade from 0.19 to latest.

Thanks both for your help with this.

It should be https://github.com/apache/druid/blob/druid-27.0.0/distribution/pom.xml#L119

<profiles>
    <profile>
        <id>dist-hadoop2</id>

But it seems gone from 29.0 onwards. So we may need to re-add it manually and see if it works with more recent versions as well.

I think that mybe it might be good to stick to version 0.28 for this upgrade, while we still need Hadoop 2 support.

Tue, Nov 11, 11:13 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work

Mon, Nov 10

elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

Had a chat with David and other folks from the AW team, and the rock-solid-always-100% target is what the aim to verify that everything works as they expect, to then iterate on more precise values.

Mon, Nov 10, 5:54 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

After a chat with the AW team last week I tried to follow up again on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192609, and I may have got what James is trying to do. I'll write down my understanding:

Mon, Nov 10, 4:33 PM · Patch-For-Review, Essential-Work, Abstract Wikipedia team (26Q2 (Oct–Dec))
elukey added a comment to T409657: Revertrisk multilingual predictor returning 500s.

@gkyziridis I am not 100% sure if the rev-id in the task's description is the problematic one, I thought it was when checking the logs but you may need to review /home/elukey/T409657 on deploy2002 to get other testing samples :(

Mon, Nov 10, 2:33 PM · Essential-Work, Machine-Learning-Team
elukey added a comment to T408632: VRTS is spammed with bounce e-mails and is going to break.

Looks like we are back in acceptable ranges again! Please let me know if anything is missing.

Mon, Nov 10, 2:32 PM · collaboration-services, Infrastructure-Foundations, SRE, vrts, Znuny
elukey added a comment to T408632: VRTS is spammed with bounce e-mails and is going to break.

Judging from the metrics it seems to me that the queues stopped growing, and they are slowly getting processed. Let's wait a bit more to see if the mitigation worked as expected.

Mon, Nov 10, 1:28 PM · collaboration-services, Infrastructure-Foundations, SRE, vrts, Znuny
elukey added a comment to T408632: VRTS is spammed with bounce e-mails and is going to break.

I am really really ignorant about postfix so please bear with me :)

Mon, Nov 10, 1:02 PM · collaboration-services, Infrastructure-Foundations, SRE, vrts, Znuny
elukey updated subscribers of T409657: Revertrisk multilingual predictor returning 500s.
Mon, Nov 10, 10:08 AM · Essential-Work, Machine-Learning-Team

Sun, Nov 9

elukey updated the task description for T409657: Revertrisk multilingual predictor returning 500s.
Sun, Nov 9, 10:17 AM · Essential-Work, Machine-Learning-Team
elukey created T409657: Revertrisk multilingual predictor returning 500s.
Sun, Nov 9, 10:15 AM · Essential-Work, Machine-Learning-Team

Fri, Nov 7

elukey added a comment to T409469: Enable ChangeProp to consume mediawiki.page_content_change.v1.

Option A would require some talk with SRE but given the size of the topic and the current /srv usage in main-eqiad / codfw I don't see any big opposition in having the stream hosted there (especially if we advertise that ML will not need to query the mediawiki API as direct consequence for the use case). It would probably be the most clean and reliable option in my opinion.

Fri, Nov 7, 4:50 PM · Data-Engineering, serviceops, Machine-Learning-Team
elukey updated subscribers of T409312: Sloth: adapt default month view to quarter view (pilot).

Me and @tappof spent quite a bit of time today trying to debug the above problem, namely that the graph showed only some days in September and nothing more. The issue seemed the sum_over_time applied to:

Fri, Nov 7, 4:45 PM · SRE-SLO
elukey added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

I had to use sum without(recorder) since the backfill process for edit-check caused another label to be added, ending up in errors while evaluating the group_left()` (many-to-many relationship).

Fri, Nov 7, 1:27 PM · SRE-SLO
elukey updated subscribers of T409310: Sloth: onboard subset of existing SLOs to pilot.

editcheck's metrics seem to lead to:

Fri, Nov 7, 7:35 AM · SRE-SLO

Thu, Nov 6

elukey added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

One thing that I cannot solve is that vector(${__to:date:seconds}) returns a unix ts for Mon Dec 1 12:59:59 AM CET 2025 and 12 when selecting the month, while the time picker in grafana is set from Sep 1st to Nov 30th (happens the same in Grafana explore). I have no idea is I am missing something stupid or not..

Thu, Nov 6, 4:47 PM · SRE-SLO

Nov 6 2025

elukey added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

New version of the two queries for the quarterly sloth panel:

Nov 6 2025, 4:42 PM · SRE-SLO
elukey added a comment to T408702: Promote dpogorzelski from ops-limited to ops.

To keep archives happy: I added the uid to the ops ldap group as well!

Nov 6 2025, 4:28 PM · SRE, SRE-Access-Requests, Machine-Learning-Team
elukey added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

On the Lift Wing side, we need to configure two things:

Nov 6 2025, 4:04 PM · Machine-Learning-Team
elukey added a comment to T278056: Upgrade Druid to latest upstream (> 0.20.1).

Today I spent some time going through the Druid changelogs, more info in T409411. The major problem seems to be that they removed the support for Hadoop 2, but there may be a hack/workaround that we could do to make it work again (but it needs to be tested).

Nov 6 2025, 10:56 AM · Data-Platform-SRE, Essential-Work
elukey added a comment to T409411: Review Druid changelogs before upgrade from 0.19 to latest.

The Java 11 version, that we currently have on Bullseye, seems deprecated but supported on recent Druid versions.

Nov 6 2025, 10:53 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
elukey added a comment to T409411: Review Druid changelogs before upgrade from 0.19 to latest.

It should be https://github.com/apache/druid/blob/druid-27.0.0/distribution/pom.xml#L119

Nov 6 2025, 10:52 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
elukey added a comment to T409411: Review Druid changelogs before upgrade from 0.19 to latest.

The worst update in my opinion is related to dropping the Hadoop 2 support, so we'll need to investigate what this means:

Nov 6 2025, 10:44 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
elukey added a comment to T409411: Review Druid changelogs before upgrade from 0.19 to latest.

Version 0.20 is interesting since it seems that a Query cache issue gets solved: https://github.com/apache/druid/releases/tag/druid-0.20.0#20-result-caching

Nov 6 2025, 10:38 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
elukey created T409411: Review Druid changelogs before upgrade from 0.19 to latest.
Nov 6 2025, 10:10 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work
elukey added a comment to T389380: Upgrade Cumin hosts to Bookworm.

New spicerack release deployed, cumin1002 is not needed anymore from Data Platform folks.

Nov 6 2025, 9:33 AM · Infrastructure-Foundations
elukey closed T390860: Elasticsearch dependency upgrade in spicerack as Resolved.

Spicerack deployed, thanks all for the work!

Nov 6 2025, 9:32 AM · Data-Platform-SRE (2025.10.17 - 2025.11.07), Essential-Work, Data-Engineering-Radar, Discovery-Search, Data-Engineering, Infrastructure-Foundations
elukey closed T390860: Elasticsearch dependency upgrade in spicerack, a subtask of T389380: Upgrade Cumin hosts to Bookworm, as Resolved.
Nov 6 2025, 9:32 AM · Infrastructure-Foundations

Nov 5 2025

elukey added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

This is the current query for a month:

Nov 5 2025, 4:47 PM · SRE-SLO
elukey reopened T398869: Create Pyrra SLOs for xLab, a subtask of T382153: Define Product-Level Service Level Objectives (SLOs) for Experimentation Lab, as Open.
Nov 5 2025, 3:04 PM · OKR-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), Test Kitchen (Experiment Platform Sprint 14)