User Details
- User Since
- Aug 31 2020, 9:52 AM (275 w, 5 d)
- Availability
- Available
- LDAP User
- Klausman
- MediaWiki User
- TKlausmann (WMF) [ Global Accounts ]
Fri, Nov 28
Machine has been reimaged and is back in the cluster, closing.
Wed, Nov 26
Folded into T367048
A concrete approach is being tracked in T401778.
Fri, Nov 21
Fri, Nov 14
Nov 4 2025
Nov 3 2025
Oct 27 2025
The issue on ml-lab1001 is (was, now) that docker did not use the big multi-TB filesystem as storage for images, but it does now. I copied the Dockerfile from Georgios' homedir to a subdir of mine and ran docker build. It seems to have worked fine:
---> 82a4c1b28e66 Successfully built 82a4c1b28e66 Successfully tagged hf:update_kserve ml-lab1001 hf $ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE hf update_kserve 82a4c1b28e66 15 seconds ago 14GB
Oct 21 2025
Oct 15 2025
ml-cache1002 can be done anytime, it just needs an Icinga/Prometheus downtime.
Sep 30 2025
This has been rolled out to both eqiad and codfw GPU machines and I restarted our one prod pod that uses GPUs (editcheck). Everything looking good.
Sep 25 2025
Sep 24 2025
As for the discrepancy (~97% vs. ~99%), I just ran the equivalent of my query (using`increase` etc) but instead of looking at the destination namespace of edit-check, used the selectors from your query (source_workload_namespace="istio-system", app="istio-ingressgateway", destination_service_namespace=~"edit-check", destination_service_name=~"edit-check-predictor.*"), and now my query agrees with the ~97% result from your initial query:
For latency, we'd use something like this:
I think this would work:
Today we found that amd-smi is not a drop-in replacement for rocm-smi when it comes to exporting metrics to Prometheus. We use our own Python wrapper to convert the output of rocm-smi to metrics that we then dump into node-exporter. The commandline parameters etc have massivel changed between the two tools, so we will have to adapt it.
Sep 19 2025
This has been deployed and confirmed working.
Sep 17 2025
Sep 11 2025
Seeing the partial successes above, I tried playing around a bit today, running rocm-smi and nvtop (the latter was originally nvidia-only, but current versions support AMD/ROCm as well). Unfortunately, both of them hang. Worse, the kernel is extremely unhappy about doing that: dmesg is full of breakage messages (log is attached). The only additional tool that I manage to get to work was this:
Sep 8 2025
ml-lab1002 fails the same way. I haven't tried 1001, but I suspect it would fail the same way as well.
Same happening with 1010. I'll put everything back in service and try the lab machines now.
On 1009, the cookbook fails with:
When trying to run the cookbook against ml-serve1008, I got this:
Sep 3 2025
Aug 27 2025
Aug 15 2025
Any time next week during my usual waking hours (0800-1800 UTC) should be doable. Just ping me on IRC.
Aug 11 2025
Jul 10 2025
Jul 9 2025
Jul 8 2025
This has been complete in the course of assorted other work like the countainerd updates.
Jul 2 2025
I've written up my thoughts, and some of the things we discussed outside of this ticket regarding making vLLM images available for use with LiftWing workloads:
Jun 24 2025
Jun 17 2025
Found it: the secrets were not wired up for staging because I had a brain fart when setting that up. It's been fixed in the private repo with commit 7bc13c5d2 (https://gerrit.wikimedia.org/r/c/labs/private/+/1160032 on the pseudo-private one), and staging now shows the correct diff:
Jun 10 2025
SSDs have been enabled and 1002 is using Ceph homedirs.
Jun 3 2025
With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided some unrelated changes):
Full error/backtrace:
# curl -H 'Host: api.wikimedia.org' 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php?action=query&formatversion=2&prop=revisions&revids=145456986&rvprop=ids%7Ccontent%7Ccomment%7Ctimestamp%7Csize%7Cuserid%7Ctags&rvslots=main&format=json'
{"batchcomplete":true,"query":{"badrevids":{"145456986":{"revid":145456986,"missing":true}}}}May 27 2025
May 22 2025
For this to work, Appropriate credentials need to be on ml-lab1002 (or 1001). The future proof way to do this would be to either apply the relevant Puppet role(s) to it, or, if that adds too much functionality/infrastructure, extract the relevant bits from that role or make that role modular as needed.
May 15 2025
May 13 2025
May 7 2025
May 6 2025
Apr 15 2025
And the undelrying CSV data for the above graph:
The full chart for running the benchmark as described by Kevin above, on the SMC-provided MI300X test machine (using one of the 8 GPUs).
Apr 11 2025
Apr 9 2025
Mar 31 2025
Mar 27 2025
Mar 25 2025
Mar 21 2025
Mar 17 2025
Mar 13 2025
Mar 4 2025
One additional note: we used to use our own Partman recipe (partman/custom/kubernetes-node-overlay-large-kubelet.cfg). Since the larger kubelet partition is already part of the containerd partman recipe (partman/custom/kubernetes-node-containerd.cfg), we don't need to have our own version of that.
