User Details
- User Since
- Aug 31 2020, 9:52 AM (301 w, 3 d)
- Availability
- Available
- LDAP User
- Klausman
- MediaWiki User
- TKlausmann (WMF) [ Global Accounts ]
Tue, Jun 2
All of the needed tweaks have been done. We may need to change things up if-when we decide on a different partitioning scheme or the like, but that should be a new Phab task.
Wed, May 27
Tue, May 26
May 5 2026
This originated with our changes to kserve/knative-serving. It has long since stopped firing, and if it fires again, it's likely unrelated to the original cause of this one, so closing.
May 4 2026
Apr 24 2026
So far, teh changes have not had the desired effect. I did some deeper digging and tried gating the start of amd-devplugin unit on the presence of the kfd devices via udev, i.e.
Steps 1-4 are covered in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275814
Apr 21 2026
The host needed to have the amd device plugin service restarted, so all the GPUs would be visible to k8s. I've done so, and the service is now scheduling on 1012.
Apr 16 2026
The parameter has been added to ml-serve-1012 and 1013, and the hosts have been rebooted. The workload we were looking at is atm scheduled on 1012. We haven't tested performance yet, but I see the following messages in the console:
Apr 15 2026
Done & done.
Apr 14 2026
After a clarifying chat with Luca about the intricacies of gRPC, HTTP/2 etc, I now have a better picture of what will need building (probably :) )
To clarify a basic assumption I have: gRPC only works over HTTP/2, and HTTP/2 is always TLS-encrypted, i.e. there is no way to speak gRPC over a plaintext connection, or at least not with the standard libraries for gRPC. If that is not the case, things might be simpler (or way more complex ;) ).
I think we can run this machine one a single disk until its replacement arrives. Even if it dies entirely, we have enough serving capacity in eqiad to handle our current workload (especially with 1014 and 1015 about to be added). We should put in a long-term silence for the specific alert and then decom the machine once it replacement arrives or the other disk also dies. The failed disk is also part of an array we currently don't use, so it should all be good.
Apr 1 2026
The following is a write-up of what I see as possible stumbling blocks for services on LiftWing using gRPC in addition to our current HTTP(S)/REST mode of operation. Note that none of these concerns are deal-breakers or insurmountable, but rather are aspects that need addressing, and may prove to be more than just 30m of work, due to the complexities of Istio, kserve and k8s in general.
Mar 31 2026
I will close this task in favor of T380279, where related/additional work for the lab/build machines will be done.
Mar 27 2026
Adding @MoritzMuehlenhoff for security aspects.
Mar 18 2026
Mar 11 2026
One thing of note about using gRPC vs. REST/HTTP(S) based communication is that a lot of the infrastructure we have built around LiftWing assumes HTTP services (e.g. using Envoy in HTTP mode). If we set up a gRPC service that services on LW call, or a gRPC service on Liftwing, we need to make sure that "the path is clear" in this sense, also regarding pod security policies and the like.
Nov 28 2025
Machine has been reimaged and is back in the cluster, closing.
Nov 26 2025
Folded into T367048
A concrete approach is being tracked in T401778.
Nov 21 2025
Nov 14 2025
Nov 4 2025
Nov 3 2025
Oct 27 2025
The issue on ml-lab1001 is (was, now) that docker did not use the big multi-TB filesystem as storage for images, but it does now. I copied the Dockerfile from Georgios' homedir to a subdir of mine and ran docker build. It seems to have worked fine:
---> 82a4c1b28e66 Successfully built 82a4c1b28e66 Successfully tagged hf:update_kserve ml-lab1001 hf $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE hf update_kserve 82a4c1b28e66 15 seconds ago 14GB
Oct 21 2025
Oct 15 2025
ml-cache1002 can be done anytime, it just needs an Icinga/Prometheus downtime.
Sep 30 2025
This has been rolled out to both eqiad and codfw GPU machines and I restarted our one prod pod that uses GPUs (editcheck). Everything looking good.
Sep 25 2025
Sep 24 2025
As for the discrepancy (~97% vs. ~99%), I just ran the equivalent of my query (using`increase` etc) but instead of looking at the destination namespace of edit-check, used the selectors from your query (source_workload_namespace="istio-system", app="istio-ingressgateway", destination_service_namespace=~"edit-check", destination_service_name=~"edit-check-predictor.*"), and now my query agrees with the ~97% result from your initial query:
For latency, we'd use something like this:
I think this would work:
Today we found that amd-smi is not a drop-in replacement for rocm-smi when it comes to exporting metrics to Prometheus. We use our own Python wrapper to convert the output of rocm-smi to metrics that we then dump into node-exporter. The commandline parameters etc have massivel changed between the two tools, so we will have to adapt it.
Sep 19 2025
This has been deployed and confirmed working.
Sep 17 2025
Sep 11 2025
Seeing the partial successes above, I tried playing around a bit today, running rocm-smi and nvtop (the latter was originally nvidia-only, but current versions support AMD/ROCm as well). Unfortunately, both of them hang. Worse, the kernel is extremely unhappy about doing that: dmesg is full of breakage messages (log is attached). The only additional tool that I manage to get to work was this:
Sep 8 2025
ml-lab1002 fails the same way. I haven't tried 1001, but I suspect it would fail the same way as well.
Same happening with 1010. I'll put everything back in service and try the lab machines now.
On 1009, the cookbook fails with:
When trying to run the cookbook against ml-serve1008, I got this:
Sep 3 2025
Aug 27 2025
Aug 15 2025
Any time next week during my usual waking hours (0800-1800 UTC) should be doable. Just ping me on IRC.
Aug 11 2025
Jul 10 2025
Jul 9 2025
Jul 8 2025
This has been complete in the course of assorted other work like the countainerd updates.
Jul 2 2025
I've written up my thoughts, and some of the things we discussed outside of this ticket regarding making vLLM images available for use with LiftWing workloads:
Jun 24 2025
Jun 17 2025
Found it: the secrets were not wired up for staging because I had a brain fart when setting that up. It's been fixed in the private repo with commit 7bc13c5d2 (https://gerrit.wikimedia.org/r/c/labs/private/+/1160032 on the pseudo-private one), and staging now shows the correct diff:
Jun 10 2025
SSDs have been enabled and 1002 is using Ceph homedirs.
Jun 3 2025
With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided some unrelated changes):
Full error/backtrace:
# curl -H 'Host: api.wikimedia.org' 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php?action=query&formatversion=2&prop=revisions&revids=145456986&rvprop=ids%7Ccontent%7Ccomment%7Ctimestamp%7Csize%7Cuserid%7Ctags&rvslots=main&format=json'
{"batchcomplete":true,"query":{"badrevids":{"145456986":{"revid":145456986,"missing":true}}}}May 27 2025
May 22 2025
For this to work, Appropriate credentials need to be on ml-lab1002 (or 1001). The future proof way to do this would be to either apply the relevant Puppet role(s) to it, or, if that adds too much functionality/infrastructure, extract the relevant bits from that role or make that role modular as needed.
May 15 2025
May 13 2025
May 7 2025
May 6 2025
Apr 15 2025
And the undelrying CSV data for the above graph:
The full chart for running the benchmark as described by Kevin above, on the SMC-provided MI300X test machine (using one of the 8 GPUs).
