Page MenuHomePhabricator

klausman (Tobias Klausmann)
User

Projects (9)

Today

  • No visible events.

Tomorrow

  • No visible events.

Sunday

  • No visible events.

User Details

User Since
Aug 31 2020, 9:52 AM (301 w, 3 d)
Availability
Available
LDAP User
Klausman
MediaWiki User
TKlausmann (WMF) [ Global Accounts ]

Recent Activity

Tue, Jun 2

klausman added a comment to T420507: MI300 machines need startup tweaks.

All of the needed tweaks have been done. We may need to change things up if-when we decide on a different partitioning scheme or the like, but that should be a new Phab task.

Tue, Jun 2, 1:10 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work

Wed, May 27

klausman added a comment to T420438: Migrate ML k8s apiserver and services to IPIP.

@klausman first code changes out for staging, after applying them we'll be able to see if anything weird pops up. The procedure is listed in https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/IPIP, do you have time do to it?

Wed, May 27, 9:11 AM · Patch-For-Review, Machine-Learning-Team, Prod-Kubernetes, Kubernetes, Liberica, Traffic

Tue, May 26

klausman created P93025 (An Untitled Masterwork).
Tue, May 26, 2:36 PM

May 5 2026

klausman closed T414971: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) as Resolved.

This originated with our changes to kserve/knative-serving. It has long since stopped firing, and if it fires again, it's likely unrelated to the original cause of this one, so closing.

May 5 2026, 2:47 PM · Machine-Learning-Team (Q4 FY2025-26), Essential-Work, sre-alert-triage

May 4 2026

klausman added a comment to T421461: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P).

@klausman I had a chat with my team and we are ok in having it deployed on all MI300X nodes. Let me know if you still want to do it or not, I'll prep the updates for provisioning etc..

May 4 2026, 2:38 PM · Machine-Learning-Team (Q4 FY2025-26), OKR-Work
klausman added a comment to T420438: Migrate ML k8s apiserver and services to IPIP.

@klausman @DPogorzelski-WMF Hi! Do you have a timeline for this work?

May 4 2026, 10:21 AM · Patch-For-Review, Machine-Learning-Team, Prod-Kubernetes, Kubernetes, Liberica, Traffic
klausman claimed T420438: Migrate ML k8s apiserver and services to IPIP.
May 4 2026, 10:21 AM · Patch-For-Review, Machine-Learning-Team, Prod-Kubernetes, Kubernetes, Liberica, Traffic

Apr 24 2026

klausman added a comment to T420507: MI300 machines need startup tweaks.

So far, teh changes have not had the desired effect. I did some deeper digging and tried gating the start of amd-devplugin unit on the presence of the kfd devices via udev, i.e.

Apr 24 2026, 12:02 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work
klausman edited P91449 (An Untitled Masterwork).
Apr 24 2026, 11:50 AM
klausman created P91449 (An Untitled Masterwork).
Apr 24 2026, 11:46 AM
klausman created P91425 (An Untitled Masterwork).
Apr 24 2026, 10:32 AM
klausman added a comment to T424318: Add ml-serve101[45] to production cluster.

Steps 1-4 are covered in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1275814

Apr 24 2026, 8:36 AM · Machine-Learning-Team (Q4 FY2025-26)
klausman created T424318: Add ml-serve101[45] to production cluster.
Apr 24 2026, 8:35 AM · Machine-Learning-Team (Q4 FY2025-26)
klausman created P91389 (An Untitled Masterwork).
Apr 24 2026, 8:07 AM

Apr 21 2026

klausman created T424049: k8s changes needed to allow article topic (and other future isvcs) to use the kserve v2 inference protocol (and gRPC).
Apr 21 2026, 2:58 PM · ServiceOps new, Traffic, Machine-Learning-Team (Q4 FY2025-26)
klausman added a comment to T421461: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P).

The host needed to have the amd device plugin service restarted, so all the GPUs would be visible to k8s. I've done so, and the service is now scheduling on 1012.

Apr 21 2026, 8:42 AM · Machine-Learning-Team (Q4 FY2025-26), OKR-Work

Apr 16 2026

klausman added a comment to T421461: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P).

I checked for ml-serve1012, we have 'IOMMU': 'Auto',, that IIUC may not be what the kernel needs in your case. Probably setting it to Enabled may be a good test, I can set it via spicerack shell anytime if needed, so you can reboot and check. Lemme know if you want it @klausman

Apr 16 2026, 4:26 PM · Machine-Learning-Team (Q4 FY2025-26), OKR-Work
klausman added a comment to T421461: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P).

The parameter has been added to ml-serve-1012 and 1013, and the hosts have been rebooted. The workload we were looking at is atm scheduled on 1012. We haven't tested performance yet, but I see the following messages in the console:

Apr 16 2026, 11:52 AM · Machine-Learning-Team (Q4 FY2025-26), OKR-Work

Apr 15 2026

klausman added a comment to T422382: Degraded RAID on ml-serve1001.

Done & done.

Apr 15 2026, 3:26 PM · Machine-Learning-Team, DC-Ops, SRE, ops-eqiad

Apr 14 2026

klausman added a comment to T421903: Investigate enabling gRPC in LiftWing model servers.

After a clarifying chat with Luca about the intricacies of gRPC, HTTP/2 etc, I now have a better picture of what will need building (probably :) )

Apr 14 2026, 3:29 PM · Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
klausman added a comment to T421903: Investigate enabling gRPC in LiftWing model servers.

To clarify a basic assumption I have: gRPC only works over HTTP/2, and HTTP/2 is always TLS-encrypted, i.e. there is no way to speak gRPC over a plaintext connection, or at least not with the standard libraries for gRPC. If that is not the case, things might be simpler (or way more complex ;) ).

Apr 14 2026, 12:34 PM · Machine-Learning-Team (Q4 FY2025-26), Lift-Wing
klausman added a comment to T422382: Degraded RAID on ml-serve1001.

I think we can run this machine one a single disk until its replacement arrives. Even if it dies entirely, we have enough serving capacity in eqiad to handle our current workload (especially with 1014 and 1015 about to be added). We should put in a long-term silence for the specific alert and then decom the machine once it replacement arrives or the other disk also dies. The failed disk is also part of an array we currently don't use, so it should all be good.

Apr 14 2026, 12:01 PM · Machine-Learning-Team, DC-Ops, SRE, ops-eqiad

Apr 1 2026

klausman added a comment to T421903: Investigate enabling gRPC in LiftWing model servers.

The following is a write-up of what I see as possible stumbling blocks for services on LiftWing using gRPC in addition to our current HTTP(S)/REST mode of operation. Note that none of these concerns are deal-breakers or insurmountable, but rather are aspects that need addressing, and may prove to be more than just 30m of work, due to the complexities of Istio, kserve and k8s in general.

Apr 1 2026, 1:52 PM · Machine-Learning-Team (Q4 FY2025-26), Lift-Wing

Mar 31 2026

klausman closed T412357: Install AMD GPU + torch version of ML Labs machines as Resolved.

I will close this task in favor of T380279, where related/additional work for the lab/build machines will be done.

Mar 31 2026, 2:57 PM · Machine-Learning-Team

Mar 27 2026

klausman updated subscribers of T421461: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P).

Adding @MoritzMuehlenhoff for security aspects.

Mar 27 2026, 9:48 AM · Machine-Learning-Team (Q4 FY2025-26), OKR-Work

Mar 18 2026

klausman created T420507: MI300 machines need startup tweaks.
Mar 18 2026, 6:04 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work

Mar 11 2026

klausman added a comment to T414112: Deploy instance of hoarde as linked-artifacts(?) in k8s.

One thing of note about using gRPC vs. REST/HTTP(S) based communication is that a lot of the infrastructure we have built around LiftWing assumes HTTP services (e.g. using Envoy in HTTP mode). If we set up a gRPC service that services on LW call, or a gRPC service on Liftwing, we need to make sure that "the path is clear" in this sense, also regarding pod security policies and the like.

Mar 11 2026, 2:18 PM · ServiceOps-Services-Oids, ServiceOps new, User-Eevans, Patch-For-Review, Data-Persistence

Nov 28 2025

klausman closed T411082: Remove old GPUs from ml-serve1001 as Resolved.

Machine has been reimaged and is back in the cluster, closing.

Nov 28 2025, 10:08 AM · SRE, DC-Ops, ops-eqiad, Machine-Learning-Team

Nov 26 2025

klausman closed T327241: Move the kserve custom helm chart to the upstream one as Resolved.

Folded into T367048

Nov 26 2025, 1:02 PM · Machine-Learning-Team
klausman closed T348155: Goal: Decide on an optional Lift Wing caching strategy for model servers as Resolved.

A concrete approach is being tracked in T401778.

Nov 26 2025, 1:01 PM · Goal, Machine-Learning-Team

Nov 21 2025

klausman added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Thus, our setup requires connection between 2 services:

  1. revise-tone-task-generator service deployed in revise-tone-task-generator namespace.
  2. outlink-topic-model service deployed in articletopic-outlink namespace.

Which initiates the connection? 1 or 2? Or is it that both can initiate a connection to the other one?

Nov 21 2025, 4:20 PM · Patch-For-Review, Machine-Learning-Team

Nov 14 2025

klausman created P85326 (An Untitled Masterwork).
Nov 14 2025, 11:39 AM

Nov 4 2025

klausman updated the task description for T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph.
Nov 4 2025, 3:37 PM · Essential-Work, Data-Platform-SRE, Machine-Learning-Team

Nov 3 2025

klausman added a comment to T367048: Update kserve to 0.15.2.

Thank you so much for working on that one @klausman!

Since the ml-lab1001 now uses the big TB filesystem, would it be possible to enable the buildkit daemon in order to use blubber, so ml-lab could be the dedicated machine for testing our production images and blubbers without translating the blubber files into Dockerfiles?

Nov 3 2025, 1:02 PM · Essential-Work, Patch-For-Review, Machine-Learning-Team

Oct 27 2025

klausman added a comment to T405647: eqiad row C/D Machine Learning host migrations.

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

Oct 27 2025, 3:54 PM · Machine-Learning-Team, SRE, DC-Ops, ops-eqiad
klausman added a comment to T367048: Update kserve to 0.15.2.

And here is the complete build log of the run mentioned above.

Oct 27 2025, 9:23 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team
klausman added a comment to T367048: Update kserve to 0.15.2.

The issue on ml-lab1001 is (was, now) that docker did not use the big multi-TB filesystem as storage for images, but it does now. I copied the Dockerfile from Georgios' homedir to a subdir of mine and ran docker build. It seems to have worked fine:

 ---> 82a4c1b28e66
Successfully built 82a4c1b28e66
Successfully tagged hf:update_kserve
ml-lab1001 hf $ docker image ls
REPOSITORY                                                                                       TAG              IMAGE ID       CREATED          SIZE
hf                                                                                               update_kserve    82a4c1b28e66   15 seconds ago   14GB
Oct 27 2025, 9:20 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team

Oct 21 2025

klausman added a member for WMF-NDA: DPogorzelski-WMF.
Oct 21 2025, 9:32 AM
klausman added a member for acl*sre-team: DPogorzelski-WMF.
Oct 21 2025, 9:30 AM

Oct 15 2025

klausman added a comment to T405647: eqiad row C/D Machine Learning host migrations.

ml-cache1002 can be done anytime, it just needs an Icinga/Prometheus downtime.

Oct 15 2025, 8:06 AM · Machine-Learning-Team, SRE, DC-Ops, ops-eqiad

Sep 30 2025

klausman closed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs as Resolved.

This has been rolled out to both eqiad and codfw GPU machines and I restarted our one prod pod that uses GPUs (editcheck). Everything looking good.

Sep 30 2025, 3:12 PM · Patch-For-Review, Essential-Work, Machine-Learning-Team
klausman closed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs, a subtask of T398948: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements, as Resolved.
Sep 30 2025, 3:12 PM · Goal, Machine-Learning-Team
klausman closed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs, a subtask of T403599: Setup & experiments for MI300x GPUs used for LiftWing, as Resolved.
Sep 30 2025, 3:12 PM · Machine-Learning-Team

Sep 25 2025

klausman created P83466 (An Untitled Masterwork).
Sep 25 2025, 8:30 AM

Sep 24 2025

klausman added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

As for the discrepancy (~97% vs. ~99%), I just ran the equivalent of my query (using`increase` etc) but instead of looking at the destination namespace of edit-check, used the selectors from your query (source_workload_namespace="istio-system", app="istio-ingressgateway", destination_service_namespace=~"edit-check", destination_service_name=~"edit-check-predictor.*"), and now my query agrees with the ~97% result from your initial query:

Sep 24 2025, 2:00 PM · Lift-Wing, Machine-Learning-Team
klausman added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

For latency, we'd use something like this:

Sep 24 2025, 12:26 PM · Lift-Wing, Machine-Learning-Team
klausman added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

I think this would work:

Sep 24 2025, 11:28 AM · Lift-Wing, Machine-Learning-Team
klausman added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

Today we found that amd-smi is not a drop-in replacement for rocm-smi when it comes to exporting metrics to Prometheus. We use our own Python wrapper to convert the output of rocm-smi to metrics that we then dump into node-exporter. The commandline parameters etc have massivel changed between the two tools, so we will have to adapt it.

Sep 24 2025, 10:06 AM · Machine-Learning-Team

Sep 19 2025

klausman closed T403047: Enable alerts for outdated admin_ng charts for ml-team as Resolved.

This has been deployed and confirmed working.

Sep 19 2025, 1:30 PM · Essential-Work, Machine-Learning-Team
klausman edited P83440 (An Untitled Masterwork).
Sep 19 2025, 9:27 AM
klausman created P83440 (An Untitled Masterwork).
Sep 19 2025, 9:23 AM

Sep 17 2025

klausman claimed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs.
Sep 17 2025, 11:50 AM · Patch-For-Review, Essential-Work, Machine-Learning-Team
klausman added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

@klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also please note that rocm-smi is now /opt/rocm-6.4.3/bin/amd-smi, please use that instead of the Debian one. I haven't tried to partition the GPU yet, so the horrors may come afterwards. If you want to do some extra checks before the partitioning lemme know so we can assess if everything works beforehand :)

Sep 17 2025, 11:43 AM · Machine-Learning-Team

Sep 11 2025

klausman added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

Seeing the partial successes above, I tried playing around a bit today, running rocm-smi and nvtop (the latter was originally nvidia-only, but current versions support AMD/ROCm as well). Unfortunately, both of them hang. Worse, the kernel is extremely unhappy about doing that: dmesg is full of breakage messages (log is attached). The only additional tool that I manage to get to work was this:

Sep 11 2025, 1:44 PM · Machine-Learning-Team

Sep 8 2025

klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

ml-lab1002 fails the same way. I haven't tried 1001, but I suspect it would fail the same way as well.

Sep 8 2025, 9:08 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

Same happening with 1010. I'll put everything back in service and try the lab machines now.

Sep 8 2025, 9:04 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

On 1009, the cookbook fails with:

Sep 8 2025, 8:58 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

When trying to run the cookbook against ml-serve1008, I got this:

Sep 8 2025, 8:31 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401441: Check list of PXE miss-configs for eqiad.
Sep 8 2025, 8:28 AM · SRE, ops-eqiad, DC-Ops

Sep 3 2025

klausman added a parent task for T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs: T403599: Setup & experiments for MI300x GPUs used for LiftWing.
Sep 3 2025, 9:51 AM · Patch-For-Review, Essential-Work, Machine-Learning-Team
klausman added a subtask for T403599: Setup & experiments for MI300x GPUs used for LiftWing: T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs.
Sep 3 2025, 9:51 AM · Machine-Learning-Team
klausman created T403599: Setup & experiments for MI300x GPUs used for LiftWing.
Sep 3 2025, 9:50 AM · Machine-Learning-Team

Aug 27 2025

klausman created T403047: Enable alerts for outdated admin_ng charts for ml-team.
Aug 27 2025, 12:20 PM · Essential-Work, Machine-Learning-Team

Aug 15 2025

klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

Any time next week during my usual waking hours (0800-1800 UTC) should be doable. Just ping me on IRC.

Aug 15 2025, 7:14 AM · SRE, ops-eqiad, DC-Ops

Aug 11 2025

klausman added a comment to T386889: MinT: Deployment timeouts for eqiad.

@klausman While deploying T335491, I didn't see any timeout for eqiad. Should we close this task?

Should we go ahead?

Aug 11 2025, 8:05 AM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), Unplanned-Sprint-Work, MinT
klausman added a comment to T396717: Fix PXE miss-configurations.

@klausman hello hope all is well. Is it possible to give us a day and time when you will be available to help us work on those servers? Thank you. it shouldn't take more then 10 minutes to fix each server.

Aug 11 2025, 8:05 AM · SRE, ops-eqiad, ops-codfw, DC-Ops

Jul 10 2025

klausman created P78864 (An Untitled Masterwork).
Jul 10 2025, 8:53 AM
klausman created P78862 (An Untitled Masterwork).
Jul 10 2025, 8:27 AM

Jul 9 2025

klausman added a comment to T393948: Q4:rack/setup/install ml-serve101[2345].

@klausman Will this be legacy or uefi? it is reachable

Jul 9 2025, 9:35 AM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

Jul 8 2025

klausman closed T367875: Reimage all ml-serve machines with Bookworm as Resolved.

This has been complete in the course of assorted other work like the countainerd updates.

Jul 8 2025, 2:31 PM · Machine-Learning-Team
klausman added a comment to T398533: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters.

@klausman Could you assist with this? I plan to test in staging first.

Jul 8 2025, 1:40 PM · Essential-Work, Machine-Learning-Team
klausman renamed T380722: Update kserve to v0.15.2* on ML clusters from Update kserve to v0.13.0 on ML clusters to Update kserve to v0.15.2* on ML clusters.
Jul 8 2025, 12:00 PM · ServiceOps new, Essential-Work, Machine-Learning-Team, Kubernetes, Prod-Kubernetes
klausman added a comment to T380722: Update kserve to v0.15.2* on ML clusters.

@klausman Shall we rename this task and switch to a newer version? A candidate could be the latest version 0.15.2

Jul 8 2025, 11:57 AM · ServiceOps new, Essential-Work, Machine-Learning-Team, Kubernetes, Prod-Kubernetes

Jul 2 2025

klausman added a comment to T394778: Build and push images to the docker registry from ml-lab.

I've written up my thoughts, and some of the things we discussed outside of this ticket regarding making vLLM images available for use with LiftWing workloads:

Jul 2 2025, 1:13 PM · Machine-Learning-Team

Jun 24 2025

klausman edited P78668 (An Untitled Masterwork).
Jun 24 2025, 1:05 PM
klausman created P78668 (An Untitled Masterwork).
Jun 24 2025, 12:09 PM

Jun 17 2025

klausman added a comment to T335491: Provide better long-term storage for translation models.

Found it: the secrets were not wired up for staging because I had a brain fart when setting that up. It's been fixed in the private repo with commit 7bc13c5d2 (https://gerrit.wikimedia.org/r/c/labs/private/+/1160032 on the pseudo-private one), and staging now shows the correct diff:

Jun 17 2025, 8:45 AM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), SRE-swift-storage, MinT
klausman added a comment to T335491: Provide better long-term storage for translation models.

@klausman is there any reason why we can't see following in the diff in staging?

+ data:                                                                                                  
+   AWS_ACCESS_KEY_ID: '++++++++ # (18 bytes)'     
+   AWS_SECRET_ACCESS_KEY: '++++++++ # (16 bytes)'
Jun 17 2025, 8:35 AM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), SRE-swift-storage, MinT

Jun 10 2025

klausman closed T391465: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) as Resolved.

SSDs have been enabled and 1002 is using Ceph homedirs.

Jun 10 2025, 2:53 PM · Machine-Learning-Team, sre-alert-triage

Jun 3 2025

klausman added a comment to T335491: Provide better long-term storage for translation models.

With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided some unrelated changes):

Jun 3 2025, 3:40 PM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), SRE-swift-storage, MinT
klausman added a comment to P76947 (An Untitled Masterwork).

Full error/backtrace:

Jun 3 2025, 1:17 PM
klausman created P76947 (An Untitled Masterwork).
Jun 3 2025, 1:05 PM
klausman added a comment to P76928 (An Untitled Masterwork).
 # curl -H 'Host: api.wikimedia.org' 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php?action=query&formatversion=2&prop=revisions&revids=145456986&rvprop=ids%7Ccontent%7Ccomment%7Ctimestamp%7Csize%7Cuserid%7Ctags&rvslots=main&format=json'
{"batchcomplete":true,"query":{"badrevids":{"145456986":{"revid":145456986,"missing":true}}}}
Jun 3 2025, 12:06 PM
klausman created P76928 (An Untitled Masterwork).
Jun 3 2025, 11:54 AM

May 27 2025

klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 4:22 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 1:43 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 10:29 AM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 9:02 AM · Machine-Learning-Team

May 22 2025

klausman added a comment to T394778: Build and push images to the docker registry from ml-lab.

For this to work, Appropriate credentials need to be on ml-lab1002 (or 1001). The future proof way to do this would be to either apply the relevant Puppet role(s) to it, or, if that adds too much functionality/infrastructure, extract the relevant bits from that role or make that role modular as needed.

May 22 2025, 8:41 AM · Machine-Learning-Team

May 15 2025

klausman renamed T393948: Q4:rack/setup/install ml-serve101[2345] from Q4:rack/setup/install ml-serve101[23] to Q4:rack/setup/install ml-serve101[2345].
May 15 2025, 2:55 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

May 13 2025

klausman placed T393948: Q4:rack/setup/install ml-serve101[2345] up for grabs.
May 13 2025, 7:43 AM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

May 7 2025

klausman created T393566: Add the ML team to the POSIX group `docker` on the ML lab machines..
May 7 2025, 9:49 AM · Machine-Learning-Team

May 6 2025

klausman created T393475: ML Services causing log spam.
May 6 2025, 2:38 PM · Machine-Learning-Team

Apr 15 2025

klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.

And the undelrying CSV data for the above graph:

Apr 15 2025, 3:06 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.

The full chart for running the benchmark as described by Kevin above, on the SMC-provided MI300X test machine (using one of the 8 GPUs).

Apr 15 2025, 3:04 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.
Apr 15 2025, 1:09 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.

Nice work Kevin!
@kevinbazira @klausman could we run the same benchmark on the MI300X?

Apr 15 2025, 8:32 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team

Apr 11 2025

klausman added a comment to T391465: Alert in need of triage: DiskSpace (instance ml-lab1001:9100).

I've deleted 30GB from my home directory.
@klausman are there any quick wins to clean up disk space for now?
I think purging the huggingface cache (aka just deleting the files under /src/hf-cache/hub/) would be ok imo

Apr 11 2025, 2:06 PM · Machine-Learning-Team, sre-alert-triage

Apr 9 2025

klausman edited P74818 (An Untitled Masterwork).
Apr 9 2025, 2:42 PM
klausman edited P74818 (An Untitled Masterwork).
Apr 9 2025, 2:41 PM