User Details
- User Since
- Jan 5 2016, 9:54 PM (517 w, 3 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Yesterday
Perfect :)
@akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could be feasible? I can take care of kicking it off and finding volunteers (sounds like me and Scott are already in :D).
elukey@deploy2002:~$ curl -k "https://10.64.70.93:6543/img/osm-intl,12,31.807,34.673,355x390.png?lang=he&domain=he.wikipedia.org&title=%D7%9E%D7%92%D7%93%D7%9C%D7%99_K&revid=40604675&groups=_8afac0e44a6b6826ba8fab43e1d72cab6b566ebd" Warning: Binary output can mess up your terminal. Use "--output -" to tell Warning: curl to output it to your terminal anyway, or consider "--output Warning: <FILE>" to save to a file.
I see a lot of {"level":"error","timestamp":"2025-12-05T09:35:50.078Z","message":"error marshalling tile: ERROR: permission denied for table planet_osm_polygon_landuse_gen_z6 in the tegola's logs, we probably have a db password problem, lemme check.
Thu, Dec 4
My understanding from the last updates is that we are not actively pushing anything to the new registry with apus as backend, we have just quickly tested some months ago. Is it the right understanding?
@JMonton-WMF Hi! I have used the kafka tools like topic mapper in the past and if not handled correctly (like throttling etc..) they can have nasty side effects, like causing bandwidth usage problems. If there is any maintenance in progress from SRE and those tools are used, we can have other unwanted side effects too.
ml-serve1013 has been added as k8s worker with the necessary taints to avoid regular pods to run on it by accident.
@Pikne wow good to know thanks! I am planning to create an endpoint for testing (likely maps-staging.wikimedia.org), that the community will be able to use to test changes like this one before hitting production. So I hope that in the future we'll be able to use the staging stack to carefully review a change before hitting real traffic (fingers crossed). I'll keep you in the loop when/if my patch will be scheduled on staging.
Wed, Dec 3
@Mvolz ahhh ok thanks for the explanation! I rechecked the graph and it shows a neat recovery towards 0%, it is still in the negative but I am confident that it will steadily progress towards a healthy/green error budget as the time passes. Let's keep it open and re-eval before the holiday break!
In the Pyrra dashboard I see a big hole during November, including the 23rd when the alert fired:
Tue, Dec 2
mmm but is allocatable something that varies dynamically? Probably not, if so everything seems working fine. Or am I missing anything?
I checked the Allocated resources for ml-serve1009, where we run the revise-tone-task pod on a GPU, and I see the following:
@Eevans I am reasoning out loud, so no need to be sorry, thanks for the follow ups :) The external service should be basically a sort of software LB for hosts outside k8s, so you can contact them from k8s simply starting from a common name. I've read a bit of code and discovered stuff like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/953283, my worry was that we had to duplicate the cassandra instance hostnames in the egress rules and in the initial hostnames list (for discovery), but it seems that we have a workaround in place.
My bad, the GPU is there:
yeah I think amd.com/gpu: 1 wasn't added when deploying aya, only tolerations, that would explain the result..
Mon, Dec 1
@MoritzMuehlenhoff what we may need to do is to move all disk/partition/raid/etc.. commands from datacenter-ops to ops-limited, what do you think?
To keep archives happy - the ml-serve1012 and 1013 hosts have been removed from the analytics vlan.
High level summary: while reviewing the Pyrra's availability graphs with the Abstract Wikipedia team we noticed several things that didn't make sense, like short severe drops affecting the downstream error budget calculations as well. After an investigation with Observability, it seems that Thanos, the system that Pyrra uses to create efficient and more compact time series / recording rules from the SLI metrics, has some consistency issues with its internal caching and sometimes it ends up storing the wrong datapoints/values in its long term storage.
The Wikifunction's SLI metrics seem to be the most affected ones, we are going to investigate the issue and report back a more permanent solution. Sadly the old time series, already in the Thanos long term storage, cannot be easily refreshed/overwritten so history may need to remain like it is.
This is a great milestone! Thanks a lot for the work Kevin :)
Thu, Nov 27
After a chat with Yiannis, the following commit was highlighted https://github.com/wikimedia/osm-bright.tm2/commit/e5b7a05b692199238ce54eb8e2e879331256d7ae. We added it while transitioning to the new kartotherian/mapnik versions, but we lost the context on what it was meant to resolve. The fonts are brought in by the osm-bright.tm2 nodejs package, so https://gerrit.wikimedia.org/r/c/mediawiki/services/kartotherian/+/1201020/ is not worth pursuing anymore. It may be worth to test a new kartotherian version with e5b7a05b692199238ce54eb8e2e879331256d7ae reverted, we now have a fully working staging environment that could be used for it.
+1 on all, seems a good plan, not sure if a higher level approach for blocking registries compared to iptables is available, but that is something that can be investigated later on.
Next steps:
@DPogorzelski-WMF I think the plan is good, I have only a few further questions:
Wed, Nov 26
At this point another alternative for the k8s world could be to have an externalservice configured, so that clients will use it to connect to random host and discover the full list. We'll still need to have egress policies for all nodes, but at least clients won't need to specify cassandra instance hostnames (but only the externalservice endpoint).
@Eevans thanks for the explanation, I kinda assumed that a query to any of the cassandra nodes would have worked as-is, routing the request to the right node (if needed) behind the scenes. IIUC you are saying that whatever endpoint is provided to connect to a cassandra, it is used to retrieve the list of nodes and then the client picks up the right IP address based on $routing policy. If this is true I am a little puzzled, it seems a really big over-complication to avoid a load balancer, but it defeats the purpose of an LVS endpoint yes.
Very weird. In theory http://outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local/v1/models/outlink-topic-model:predict should set the Host header to outlink-topic-model-predictor.articletopic-outlink.svc.cluster.local behind the scenes (at least any HTTP client should do it). Maybe something specific in the Python code? We can try to inspect istio-proxy logs and see what is the reason for the 502 (usually there is a note about it), but I don't want to slow down the task if it is already working. It would be nice to figure out what's happening :D
@herron this task should be good in my opinion for the pilot's goals, we'll may need to tune it a little further if we decide to use Sloth but I wouldn't spend a ton of time on it in Q2. Lemme know!
Next steps:
The host is depooled:
Hi folks!
Tue, Nov 25
Updated deletion list:
Today I tried to take a look to spikes like the following, shown in Grafana by Pyrra metrics:
@jcrespo sadly the upstream website changed and the way that we used to get the latest firmware doesn't work anymore. The only supported/working way is to stage the firmwares manually on the cumin nodes and use those :(
Looping in also @BTullis and @brouberol for a quick high level discussion, since AQS will be probably the first cluster to target :)
Mon, Nov 24
@Eevans my proposal would be to evaluate the possibility of having an LVS endpoint in front of all clusters. This would allow the following:
I always forget that istioctl is always a good friend :)
IIUC Knative just sits behind the kube svcs for the inference services to provide autoscaling-like services/buffering etc.. It shouldn't influence the routing, in theory :D
All good thanks!
@brouberol I think that the image is still running in a few places:
Thu, Nov 13
Really nice! I'll be afk next week for holidays, but @RLazarus may be able to follow up in the meantime :) Otherwise I'll pick it up when I am back!
Let's keep it open until the alerts are up :)
@Jdforrester-WMF @cmassaro one thing that I am wondering - what is the HTTP response code that Wikifunctions returns when a request hits the 10s timeout?
After a chat with Moritz we realized that the better path is probably to create another account for staging, and create the new container in there. In this way we fully disentangle prod from staging, and we don't risk to mess up prod tiles when working in staging.
Created a new bucket with swift post and the Tegola AUTH credentials on thanos-fe1004:
Wed, Nov 12
The client seems to be a single one from the Istio Ingress logs, but the URI seemed to be /v1/models/revertrisk-multilingual:predict so it may make sense that they tried to query the service with a JSON payload that triggered the problem.
@gkyziridis I am still puzzled by the main exception:
@DSantamaria while we decide the best approach, it would be great to also discuss the path that we'll take after this first round of configurations. IIUC from our last discussion the 10s bucket is only the first step, to then find a more suitable/realistic target for the SLO (that wouldn't be always report an error budget of 100%). Does AW have a number in mind? What steps are we going to take to find the right value?
Exactly yes, it is here. We also have an extra envoy timeout set to 15s IIUC as extra fence, but the one that counts is the orchestrator's one.
@Nikki great investigation and datapoints, thanks a lot!
Tue, Nov 11
@herron Hi! Could you please backfill slo:period_error_budget_remaining:ratio too? I see that the time series start from Oct 27th, this is the rolling window metric and I'd like to see how it looks over a quarter.
@cmassaro I think it is probably something in the docker image / WF service itself, I haven't found a k8s configuration that triggers the 10s timeout yet.
Mon, Nov 10
Had a chat with David and other folks from the AW team, and the rock-solid-always-100% target is what the aim to verify that everything works as they expect, to then iterate on more precise values.
After a chat with the AW team last week I tried to follow up again on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192609, and I may have got what James is trying to do. I'll write down my understanding:
@gkyziridis I am not 100% sure if the rev-id in the task's description is the problematic one, I thought it was when checking the logs but you may need to review /home/elukey/T409657 on deploy2002 to get other testing samples :(
Looks like we are back in acceptable ranges again! Please let me know if anything is missing.
Judging from the metrics it seems to me that the queues stopped growing, and they are slowly getting processed. Let's wait a bit more to see if the mitigation worked as expected.
I am really really ignorant about postfix so please bear with me :)
Sun, Nov 9
Fri, Nov 7
Option A would require some talk with SRE but given the size of the topic and the current /srv usage in main-eqiad / codfw I don't see any big opposition in having the stream hosted there (especially if we advertise that ML will not need to query the mediawiki API as direct consequence for the use case). It would probably be the most clean and reliable option in my opinion.
Me and @tappof spent quite a bit of time today trying to debug the above problem, namely that the graph showed only some days in September and nothing more. The issue seemed the sum_over_time applied to:
I had to use sum without(recorder) since the backfill process for edit-check caused another label to be added, ending up in errors while evaluating the group_left()` (many-to-many relationship).
editcheck's metrics seem to lead to:
Thu, Nov 6
One thing that I cannot solve is that vector(${__to:date:seconds}) returns a unix ts for Mon Dec 1 12:59:59 AM CET 2025 and 12 when selecting the month, while the time picker in grafana is set from Sep 1st to Nov 30th (happens the same in Grafana explore). I have no idea is I am missing something stupid or not..
Nov 6 2025
New version of the two queries for the quarterly sloth panel:
To keep archives happy: I added the uid to the ops ldap group as well!
On the Lift Wing side, we need to configure two things:
Today I spent some time going through the Druid changelogs, more info in T409411. The major problem seems to be that they removed the support for Hadoop 2, but there may be a hack/workaround that we could do to make it work again (but it needs to be tested).
The Java 11 version, that we currently have on Bullseye, seems deprecated but supported on recent Druid versions.
The worst update in my opinion is related to dropping the Hadoop 2 support, so we'll need to investigate what this means:
Version 0.20 is interesting since it seems that a Query cache issue gets solved: https://github.com/apache/druid/releases/tag/druid-0.20.0#20-result-caching
New spicerack release deployed, cumin1002 is not needed anymore from Data Platform folks.
Spicerack deployed, thanks all for the work!
Nov 5 2025
This is the current query for a month: