User Details
- User Since
- Jan 5 2016, 9:54 PM (535 w, 4 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Fri, Apr 10
I filed https://gerrit.wikimedia.org/r/1269998 as proposal for a conservative first step, to be applied to the ML clusters first (they produce a ton more time series than the others via Istio).
Thu, Apr 9
Hey! Adding a few notes/thoughts:
The only weird thing happened is that the Istio Gateway and the jaeger grpc collector failed to talk to each other via grpc due to TLS validation failures happening on both fronts. I "fixed" it via the following hot-fix in the Jaeger's Destination Rule:
Something to follow up on - os-reports seems to be a CNAME for the aux ingress rw endpoint, only available in eqiad. Should we move it to the -ro one, that is active / active?
After the upgrade:
Before the aux-eqiad upgrade:
@BCornwall is there a specific downtime that you have in mind for the LVS servers? So we can have more context.. As Riccardo mentioned the Icinga "API" is not great, any chance that the downtime could become an Alertmanager one?
Wed, Apr 8
Since the debmonitor intermediate expires before the discovery one, I'd propose to:
I tested in the kafka upgrade pontoon environment the following:
There is another thing to test imho before proceeding with prod, namely https://github.com/kserve/kserve/pull/3316
Post upgrade:
Before the aux-codfw upgrade:
@jijiki shall we deploy mcrouter 2023.07.17.00-2 and test the keep alive options? I have the feeling that the TKOs will go down after it.
Tue, Apr 7
It is interesting how the TKOs started to reduce from around the third/fourth of April.
Fri, Apr 3
Filed a couple of changes to introduce ruff to spicerack and rework how linting/testing/docs run. I got down to ~60s local time and ~3m CI time (with tox creating venvs and installing deps), and ~19s while running locally with venvs already installed.
Thu, Apr 2
@cmooney I added some info T420223#11753137, where I tested jitter seen by MTR on a worker in row A/B vs a worker in C/D: the former doesn't show it. I also tried on another couple of nodes, but I don't have anything definitive form a statistics point of view. I can collect more info if you want!
Wed, Apr 1
The workaround in the last patch needs a spicerack change for ipmi, since we assume the root user:
Tue, Mar 31
Other services:
Checked the services currently deployed (removed the ones that are not service-related):
High level, this is what I have in mind:
The X509v3 Subject Key Identifier changes between old and new intermediate certs (and it makes sense, new private key), so the current leaf discovery certs are not compatible with the new intermediate. I need to verify if https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate assumes that puppet's cfssl class will request a new cert when the .chain.pem file changes. If so, the key rotation should be handled transparently by puppet runs and restarts, if not we need to figure out how to fix this :D
Mon, Mar 30
This is becoming a little more complicated than what the wikitech page suggests: https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate
In Prod we are working on reducing the buckets in T392886, now that we run a relatively recent version of Istio.
Just to confirm:
@hashar Hi! Any thoughts about this? Could we work on it during the next quarter as Releng/SRE collaboration?
I did some reading and my understanding is that with iommu=pt there is no protection from the kernel about unauthorized memory access that doesn't belong to a process, so in theory an attacker could:
- compromise a model server running on a mi300x-enabled host (via vLLM etc...)
- exploit a bug in the amdgpu kernel driver and basically get full access to memory on the host (completely compromised)
Fri, Mar 27
I got another issue after my patch, namely the root user creation (in the BMC) returns a plain HTTP 400. I tried this from the spicerack shell:
@cmooney yes Effie depooled it IIRC!
Updated the links, they now work :)
@DPogorzelski-WMF as you prefer, but the difference between istio/kserve and knative is not really huge and I would personally upgrade first rather than keeping the old version. Also remember to check the new version of the knative chart too.
Thu, Mar 26
Repooled after maintenance. Doc added to https://wikitech.wikimedia.org/wiki/Maps/v2/Common_tasks#Fix_a_Broken_postgres_replica
These nodes should be depooled to test if tkos decrese considerably for a stable amount of time:
I retried the above experiment with an eqiad memcached shard:
My understanding was that tox-uv needed to be selected in tox.ini, but after a chart with Riccardo I realized I was wrong, it is picked up automatically. We may want to create a new image for tox-v4-uv at this point, and opt-in from integration config?
@DPogorzelski-WMF my understanding from reading upstream's commits is that they bump the k8s version every now and then and they set it to the version that it was tested in their testing env:
The top mcrouter pod for TKOs is mcrouter-main-d7czx, running on wikikube-worker1070.eqiad.wmnet with IP 10.67.223.126.
I reworked a little the graph that shows TKOs registered by mcrouter pods, and I added a column to sort by total occurrences. The first 14 pods have their k8s workers in the following racks:
It is very interesting what happened directly after the depool of codfw for the MW Switchover - the total number of mcrouter rps in eqiad jumped to 1.3M/s (!!) and the tkos raised somehow proportionally, peaking now at 7/8k rps. We are still around the 0.5% of requests ending up in TKO, so I am more and more convinced that this may be an issue of connection recycling that may cause this issue. Why it doesn't happen the same in codfw is not clear to me yet, but my proposal is to test the settings that I mentioned above as next step:
Wed, Mar 25
@ecarg I am trying to get the new 504 logs via https://logstash.wikimedia.org/goto/82a1feda83be4ad00e9c24b95268c329, is it what we are expecting? I don't have a lot of context, so I cannot judge if there is a clear motivation why the 504 happened, so my question is: are now the logs actionable? For example, from those would you be able to understand why the 504 happened? I am asking since we'll need to get to the root cause of those 504s, and these logs should hopefully tell us how. If it is not the case: is there anything that we can do to improve them to give us more info?
Should be good now!
@Jdforrester-WMF Before closing let's verify your assumption, ideally we should have some way of checking what's happening in dashboards or similar. I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1260731, IIUC you are saying that Wikifunctions being slow and downsized may play a role, or is it just the MediaWiki part that you are concerned with? Or both? :D
After a chat with Riccardo prospector is used in wmflib only for pyroma and vulture, and the latter is disabled for cookbooks, so now I better understand Federico's intent.
I tried to compare what we did with wmflib with Federico's proposal, to try to summarize a possible standard approach. The differences are:
Tue, Mar 24
I'm not sure I understand the comment about needing an http replica, but frankly, what stops you from running your "service" (which is just a lambda) with a sidecar of the linked artifact cache to provide the http interface?
To keep archives happy - I used the following workaround in provisioning and it worked:
Mon, Mar 23
@Jclark-ctr all hosts provisioned! The new cookbook is not merged, but I thought to unblock you :)
Something interesting while checking mcrouter stats (the full list, not the ones exposed via the exporter):
Re-opening this one since something weird happens when running provisioning:
The gutter pool in eqiad matches the December pattern, due to tkos: dashboard
I've reworked a little https://grafana-rw.wikimedia.org/d/ltSHWhHIk/mw-mcrouter to use irate() everywhere and not a mix of increase and irate, to more easily compare panels. In eqiad we roughly handle 400 rps and we have a background of 1 or 2k requests / s tko-ed, that is around 0.5%. I think that this is a great example of use case where an SLO would be perfect to understand the urgency of this task, since we are talking about 99.5+ SLO target for availability. I am not suggesting to not work on this task, but its priority may need to vary since the impact of the problem is not big at the moment.
Fri, Mar 20
@cmooney one thing worth to look is if we added QOS or similar changes over the network in response to the traffic attacks that happened in December. I found T412785 but the timeline doesn't match 100% to the timings provided by the metrics (started around 16th Dec ~13:50 more or less). mcrouter is configured to SET certain keys to both eqiad and codfw, waiting for both before returning success. We also have a 250ms timeout of cross-region calls like mcrouter in eqiad reaching codfw, so anything that slows down may start some timeout errors. There is nothing that lines up in puppet or sal that I can see, or even k8s deployments.
Thu, Mar 19
I thought the same but if you count 1661 / 22 you get ~75 errors for each hour, that is not ideal but not even a ton. Ideally it shouldn't be there, but I do see some TCP RST sent periodically by the memcached shards to the mcrouter pods (via tcpdump) so that my correlate.
After some back and forth with Cathal on IRC, we didn't find anything that could point to accept() being problematic.
In general this worries me a little:
@jijiki I checked the thread, and I think it is a separate issue. Yesterday I found occurrences of accept4(): Resource temporarily unavailable, meanwhile the thread mentions people stracing memcached and seeing read() EAGAIN. The former means that memcached wasn't able to create a new socket via accept, the latter that no data was read (that is expected in async code). This is why I am worried - accept() not being able to work is usually a sign of distress of the network stack, there is probably a bottleneck somewhere.
Worth to remember - amd-smi is currently relying on a manual fix in the Python files due to https://github.com/ROCm/amdsmi/pull/136, since we were waiting for the rocm 7.2.0 release (afaics it should contain the fix). Without it any reimage would bring up the wrong version of the libs, and partitioning becomes impossible.
Wed, Mar 18
@cmooney Check my update above, there are a couple of IPs that we could use to verify if anything is weird in their path. It is difficult from the mcrouter logs to understand what IP is related to a timeout, but I do see RSTs when trying to connect from eqiad to codfw via TLS. Memcached also shows a high number of failed TLS transactions, we don't have it as metric so it may be something unrelated, but it smells like something problematic.
sudo lsof -p 553977 | wc -l 4842
I tried to isolate a connection between a single mcrouter pod in eqiad (10.67.189.81 on wikikube-worker1270.eqiad.wmnet) and mc2039 (10.192.0.22) in codfw:
Tue, Mar 17
Saw the task passing by and got nerd-sniped :)
@Jclark-ctr I provisioned dse-k8s-worker1020 with an experimental provisioning cookbook, when you have a moment could you please tell me if everything looks good?
Mon, Mar 16
Kafka 3.7 is running on deployment-kafka-logging01 with Debian Trixie, first one of its kind!
This requires more work, since those models are X13 of a very new generation that don't accept BIOS updates to /redfish/v1/Systems/1/Bios anymore, that has become a read-only API. The new endpoint is /redfish/v1/Systems/1/Bios/SD and it uses a series of keys that are different from the previous endpoint. We'll have to add everything to the provision cookbook, like we did when we introduced the new dells with idrac10. It will likely take one week :(
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# kube-env admin ml-serve-codfw
root@deploy2002:/srv/deployment-charts/helmfile.d/admin_ng# helm3 -n kserve history kserve
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Mon Feb 23 17:00:58 2026 superseded kserve-0.2.9 0.11.2 Install complete
2 Tue Mar 3 08:31:38 2026 superseded kserve-0.2.9 0.11.2 Upgrade "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block
3 Tue Mar 3 08:31:43 2026 failed kserve-0.2.9 0.11.2 Rollback "kserve" failed: cannot patch "inferenceservices.serving.kserve.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "inferenceservices.serving.kserve.io" is invalid: spec.conversion.webhookClientConfig.caBundle: Invalid value: []byte{0xa}: unable to load root certificates: unable to parse bytes as PEM block
4 Mon Mar 16 09:51:44 2026 deployed kserve-0.3.0 0.11.2 Upgrade completeI remember that Traffic suggested the use of hardened TLS settings to allow the use case of mTLS when pushing webrequest data to Kafka Jumbo, and we probably wanted to apply it everywhere if mTLS was needed elsewhere (or maybe I misremember).
Deployed on ml-serve-eqiad:
I followed up again today and created https://wikitech.wikimedia.org/wiki/User:LToscano_(WMF)/AbstractWikipedia#Dashboards to collect useful debugging tips and dashboard links in one place. The Istio logs seem to indicate that the Orchestrator returns HTTP 504s because its proxied cluster, the evaluators, return 504 as well (no indication of timeouts happening etc..).
Logstash filters for:
Very interesting, from the link it seems that we could create an override file simply adding something like -Djava.security.properties=/etc/sysconfig/jvm.java.security to the target service's JAVA_OPTS. The only downside that I see is that we'll need to explicitly inject those options in every specific Java deployment, otherwise we'll get the standard security settings shipped with Debian. Maybe it is fine, I am not sure what is our policy (do we need to apply the custom security settings to all JVM deployments or just to say CAS/Kafka/etc..?)
Sat, Mar 14
Nothing in racadm getsal, nothing in mariadb's journalctl and dmesg, so I am inclined to mark this as an hardware-related stall (when I tried to ssh via serial console the console com2 was a blank screen).
Next steps: