Page MenuHomePhabricator

klausman (Tobias Klausmann)
User

Projects (8)

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Aug 31 2020, 9:52 AM (275 w, 5 d)
Availability
Available
LDAP User
Klausman
MediaWiki User
TKlausmann (WMF) [ Global Accounts ]

Recent Activity

Fri, Nov 28

klausman closed T411082: Remove old GPUs from ml-serve1001 as Resolved.

Machine has been reimaged and is back in the cluster, closing.

Fri, Nov 28, 10:08 AM · SRE, DC-Ops, ops-eqiad, Machine-Learning-Team

Wed, Nov 26

klausman closed T327241: Move the kserve custom helm chart to the upstream one as Resolved.

Folded into T367048

Wed, Nov 26, 1:02 PM · Machine-Learning-Team
klausman closed T348155: Goal: Decide on an optional Lift Wing caching strategy for model servers as Resolved.

A concrete approach is being tracked in T401778.

Wed, Nov 26, 1:01 PM · Goal, Machine-Learning-Team

Fri, Nov 21

klausman added a comment to T408538: Create a Revise Tone Task Generator in LiftWing.

Thus, our setup requires connection between 2 services:

  1. revise-tone-task-generator service deployed in revise-tone-task-generator namespace.
  2. outlink-topic-model service deployed in articletopic-outlink namespace.

Which initiates the connection? 1 or 2? Or is it that both can initiate a connection to the other one?

Fri, Nov 21, 4:20 PM · Patch-For-Review, Machine-Learning-Team

Fri, Nov 14

klausman created P85326 (An Untitled Masterwork).
Fri, Nov 14, 11:39 AM

Nov 4 2025

klausman updated the task description for T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph.
Nov 4 2025, 3:37 PM · Essential-Work, Data-Platform-SRE, Machine-Learning-Team

Nov 3 2025

klausman added a comment to T367048: Update kserve to 0.15.2.

Thank you so much for working on that one @klausman!

Since the ml-lab1001 now uses the big TB filesystem, would it be possible to enable the buildkit daemon in order to use blubber, so ml-lab could be the dedicated machine for testing our production images and blubbers without translating the blubber files into Dockerfiles?

Nov 3 2025, 1:02 PM · Essential-Work, Patch-For-Review, Machine-Learning-Team

Oct 27 2025

klausman added a comment to T405647: eqiad row C/D Machine Learning host migrations.

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

Oct 27 2025, 3:54 PM · Machine-Learning-Team, SRE, DC-Ops, ops-eqiad
klausman added a comment to T367048: Update kserve to 0.15.2.

And here is the complete build log of the run mentioned above.

Oct 27 2025, 9:23 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team
klausman added a comment to T367048: Update kserve to 0.15.2.

The issue on ml-lab1001 is (was, now) that docker did not use the big multi-TB filesystem as storage for images, but it does now. I copied the Dockerfile from Georgios' homedir to a subdir of mine and ran docker build. It seems to have worked fine:

 ---> 82a4c1b28e66
Successfully built 82a4c1b28e66
Successfully tagged hf:update_kserve
ml-lab1001 hf $ docker image ls
REPOSITORY                                                                                       TAG              IMAGE ID       CREATED          SIZE
hf                                                                                               update_kserve    82a4c1b28e66   15 seconds ago   14GB
Oct 27 2025, 9:20 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team

Oct 21 2025

klausman added a member for WMF-NDA: DPogorzelski-WMF.
Oct 21 2025, 9:32 AM
klausman added a member for acl*sre-team: DPogorzelski-WMF.
Oct 21 2025, 9:30 AM

Oct 15 2025

klausman added a comment to T405647: eqiad row C/D Machine Learning host migrations.

ml-cache1002 can be done anytime, it just needs an Icinga/Prometheus downtime.

Oct 15 2025, 8:06 AM · Machine-Learning-Team, SRE, DC-Ops, ops-eqiad

Sep 30 2025

klausman closed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs as Resolved.

This has been rolled out to both eqiad and codfw GPU machines and I restarted our one prod pod that uses GPUs (editcheck). Everything looking good.

Sep 30 2025, 3:12 PM · Patch-For-Review, Essential-Work, Machine-Learning-Team
klausman closed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs, a subtask of T398948: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements, as Resolved.
Sep 30 2025, 3:12 PM · Goal, Machine-Learning-Team
klausman closed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs, a subtask of T403599: Setup & experiments for MI300x GPUs used for LiftWing, as Resolved.
Sep 30 2025, 3:12 PM · Machine-Learning-Team

Sep 25 2025

klausman created P83466 (An Untitled Masterwork).
Sep 25 2025, 8:30 AM

Sep 24 2025

klausman added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

As for the discrepancy (~97% vs. ~99%), I just ran the equivalent of my query (using`increase` etc) but instead of looking at the destination namespace of edit-check, used the selectors from your query (source_workload_namespace="istio-system", app="istio-ingressgateway", destination_service_namespace=~"edit-check", destination_service_name=~"edit-check-predictor.*"), and now my query agrees with the ~97% result from your initial query:

Sep 24 2025, 2:00 PM · Lift-Wing, Machine-Learning-Team
klausman added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

For latency, we'd use something like this:

Sep 24 2025, 12:26 PM · Lift-Wing, Machine-Learning-Team
klausman added a comment to T405338: Calculate tone check model service metrics for fixed calendar window.

I think this would work:

Sep 24 2025, 11:28 AM · Lift-Wing, Machine-Learning-Team
klausman added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

Today we found that amd-smi is not a drop-in replacement for rocm-smi when it comes to exporting metrics to Prometheus. We use our own Python wrapper to convert the output of rocm-smi to metrics that we then dump into node-exporter. The commandline parameters etc have massivel changed between the two tools, so we will have to adapt it.

Sep 24 2025, 10:06 AM · Machine-Learning-Team

Sep 19 2025

klausman closed T403047: Enable alerts for outdated admin_ng charts for ml-team as Resolved.

This has been deployed and confirmed working.

Sep 19 2025, 1:30 PM · Essential-Work, Machine-Learning-Team
klausman edited P83440 (An Untitled Masterwork).
Sep 19 2025, 9:27 AM
klausman created P83440 (An Untitled Masterwork).
Sep 19 2025, 9:23 AM

Sep 17 2025

klausman claimed T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs.
Sep 17 2025, 11:50 AM · Patch-For-Review, Essential-Work, Machine-Learning-Team
klausman added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

@klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also please note that rocm-smi is now /opt/rocm-6.4.3/bin/amd-smi, please use that instead of the Debian one. I haven't tried to partition the GPU yet, so the horrors may come afterwards. If you want to do some extra checks before the partitioning lemme know so we can assess if everything works beforehand :)

Sep 17 2025, 11:43 AM · Machine-Learning-Team

Sep 11 2025

klausman added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

Seeing the partial successes above, I tried playing around a bit today, running rocm-smi and nvtop (the latter was originally nvidia-only, but current versions support AMD/ROCm as well). Unfortunately, both of them hang. Worse, the kernel is extremely unhappy about doing that: dmesg is full of breakage messages (log is attached). The only additional tool that I manage to get to work was this:

Sep 11 2025, 1:44 PM · Machine-Learning-Team

Sep 8 2025

klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

ml-lab1002 fails the same way. I haven't tried 1001, but I suspect it would fail the same way as well.

Sep 8 2025, 9:08 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

Same happening with 1010. I'll put everything back in service and try the lab machines now.

Sep 8 2025, 9:04 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

On 1009, the cookbook fails with:

Sep 8 2025, 8:58 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

When trying to run the cookbook against ml-serve1008, I got this:

Sep 8 2025, 8:31 AM · SRE, ops-eqiad, DC-Ops
klausman added a comment to T401441: Check list of PXE miss-configs for eqiad.
Sep 8 2025, 8:28 AM · SRE, ops-eqiad, DC-Ops

Sep 3 2025

klausman added a parent task for T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs: T403599: Setup & experiments for MI300x GPUs used for LiftWing.
Sep 3 2025, 9:51 AM · Patch-For-Review, Essential-Work, Machine-Learning-Team
klausman added a subtask for T403599: Setup & experiments for MI300x GPUs used for LiftWing: T398600: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs.
Sep 3 2025, 9:51 AM · Machine-Learning-Team
klausman created T403599: Setup & experiments for MI300x GPUs used for LiftWing.
Sep 3 2025, 9:50 AM · Machine-Learning-Team

Aug 27 2025

klausman created T403047: Enable alerts for outdated admin_ng charts for ml-team.
Aug 27 2025, 12:20 PM · Essential-Work, Machine-Learning-Team

Aug 15 2025

klausman added a comment to T401964: PXE provision script needed for ml-lab and ml-serve hosts.

Any time next week during my usual waking hours (0800-1800 UTC) should be doable. Just ping me on IRC.

Aug 15 2025, 7:14 AM · SRE, ops-eqiad, DC-Ops

Aug 11 2025

klausman added a comment to T386889: MinT: Deployment timeouts for eqiad.

@klausman While deploying T335491, I didn't see any timeout for eqiad. Should we close this task?

Should we go ahead?

Aug 11 2025, 8:05 AM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), Unplanned-Sprint-Work, MinT
klausman added a comment to T396717: Fix PXE miss-configurations.

@klausman hello hope all is well. Is it possible to give us a day and time when you will be available to help us work on those servers? Thank you. it shouldn't take more then 10 minutes to fix each server.

Aug 11 2025, 8:05 AM · SRE, ops-eqiad, DC-Ops, ops-codfw

Jul 10 2025

klausman created P78864 (An Untitled Masterwork).
Jul 10 2025, 8:53 AM
klausman created P78862 (An Untitled Masterwork).
Jul 10 2025, 8:27 AM

Jul 9 2025

klausman added a comment to T393948: Q4:rack/setup/install ml-serve101[2345].

@klausman Will this be legacy or uefi? it is reachable

Jul 9 2025, 9:35 AM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

Jul 8 2025

klausman closed T367875: Reimage all ml-serve machines with Bookworm as Resolved.

This has been complete in the course of assorted other work like the countainerd updates.

Jul 8 2025, 2:31 PM · Machine-Learning-Team
klausman added a comment to T398533: Update knative's queue proxy image and the Swift/S3 accounts used on ml-serve clusters.

@klausman Could you assist with this? I plan to test in staging first.

Jul 8 2025, 1:40 PM · Essential-Work, Machine-Learning-Team
klausman renamed T380722: Update kserve to v0.15.2* on ML clusters from Update kserve to v0.13.0 on ML clusters to Update kserve to v0.15.2* on ML clusters.
Jul 8 2025, 12:00 PM · Essential-Work, Machine-Learning-Team, Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
klausman added a comment to T380722: Update kserve to v0.15.2* on ML clusters.

@klausman Shall we rename this task and switch to a newer version? A candidate could be the latest version 0.15.2

Jul 8 2025, 11:57 AM · Essential-Work, Machine-Learning-Team, Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops

Jul 2 2025

klausman added a comment to T394778: Build and push images to the docker registry from ml-lab.

I've written up my thoughts, and some of the things we discussed outside of this ticket regarding making vLLM images available for use with LiftWing workloads:

Jul 2 2025, 1:13 PM · Machine-Learning-Team

Jun 24 2025

klausman edited P78668 (An Untitled Masterwork).
Jun 24 2025, 1:05 PM
klausman created P78668 (An Untitled Masterwork).
Jun 24 2025, 12:09 PM

Jun 17 2025

klausman added a comment to T335491: Provide better long-term storage for translation models.

Found it: the secrets were not wired up for staging because I had a brain fart when setting that up. It's been fixed in the private repo with commit 7bc13c5d2 (https://gerrit.wikimedia.org/r/c/labs/private/+/1160032 on the pseudo-private one), and staging now shows the correct diff:

Jun 17 2025, 8:45 AM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), SRE-swift-storage, MinT
klausman added a comment to T335491: Provide better long-term storage for translation models.

@klausman is there any reason why we can't see following in the diff in staging?

+ data:                                                                                                  
+   AWS_ACCESS_KEY_ID: '++++++++ # (18 bytes)'     
+   AWS_SECRET_ACCESS_KEY: '++++++++ # (16 bytes)'
Jun 17 2025, 8:35 AM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), SRE-swift-storage, MinT

Jun 10 2025

klausman closed T391465: Alert in need of triage: DiskSpace (instance ml-lab1001:9100) as Resolved.

SSDs have been enabled and 1002 is using Ceph homedirs.

Jun 10 2025, 2:53 PM · Machine-Learning-Team, sre-alert-triage

Jun 3 2025

klausman added a comment to T335491: Provide better long-term storage for translation models.

With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided some unrelated changes):

Jun 3 2025, 3:40 PM · LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5), LPL Essential (2025 Jul-Oct), SRE-swift-storage, MinT
klausman added a comment to P76947 (An Untitled Masterwork).

Full error/backtrace:

Jun 3 2025, 1:17 PM
klausman created P76947 (An Untitled Masterwork).
Jun 3 2025, 1:05 PM
klausman added a comment to P76928 (An Untitled Masterwork).
 # curl -H 'Host: api.wikimedia.org' 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php?action=query&formatversion=2&prop=revisions&revids=145456986&rvprop=ids%7Ccontent%7Ccomment%7Ctimestamp%7Csize%7Cuserid%7Ctags&rvslots=main&format=json'
{"batchcomplete":true,"query":{"badrevids":{"145456986":{"revid":145456986,"missing":true}}}}
Jun 3 2025, 12:06 PM
klausman created P76928 (An Untitled Masterwork).
Jun 3 2025, 11:54 AM

May 27 2025

klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 4:22 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 1:43 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 10:29 AM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
May 27 2025, 9:02 AM · Machine-Learning-Team

May 22 2025

klausman added a comment to T394778: Build and push images to the docker registry from ml-lab.

For this to work, Appropriate credentials need to be on ml-lab1002 (or 1001). The future proof way to do this would be to either apply the relevant Puppet role(s) to it, or, if that adds too much functionality/infrastructure, extract the relevant bits from that role or make that role modular as needed.

May 22 2025, 8:41 AM · Machine-Learning-Team

May 15 2025

klausman renamed T393948: Q4:rack/setup/install ml-serve101[2345] from Q4:rack/setup/install ml-serve101[23] to Q4:rack/setup/install ml-serve101[2345].
May 15 2025, 2:55 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

May 13 2025

klausman placed T393948: Q4:rack/setup/install ml-serve101[2345] up for grabs.
May 13 2025, 7:43 AM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

May 7 2025

klausman created T393566: Add the ML team to the POSIX group `docker` on the ML lab machines..
May 7 2025, 9:49 AM · Machine-Learning-Team

May 6 2025

klausman created T393475: ML Services causing log spam.
May 6 2025, 2:38 PM · Machine-Learning-Team

Apr 15 2025

klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.

And the undelrying CSV data for the above graph:

Apr 15 2025, 3:06 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.

The full chart for running the benchmark as described by Kevin above, on the SMC-provided MI300X test machine (using one of the 8 GPUs).

Apr 15 2025, 3:04 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.
Apr 15 2025, 1:09 PM · Patch-For-Review, Lift-Wing, Machine-Learning-Team
klausman added a comment to T385173: Use rocm/vllm image on Lift Wing.

Nice work Kevin!
@kevinbazira @klausman could we run the same benchmark on the MI300X?

Apr 15 2025, 8:32 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team

Apr 11 2025

klausman added a comment to T391465: Alert in need of triage: DiskSpace (instance ml-lab1001:9100).

I've deleted 30GB from my home directory.
@klausman are there any quick wins to clean up disk space for now?
I think purging the huggingface cache (aka just deleting the files under /src/hf-cache/hub/) would be ok imo

Apr 11 2025, 2:06 PM · Machine-Learning-Team, sre-alert-triage

Apr 9 2025

klausman edited P74818 (An Untitled Masterwork).
Apr 9 2025, 2:42 PM
klausman edited P74818 (An Untitled Masterwork).
Apr 9 2025, 2:41 PM
klausman created P74818 (An Untitled Masterwork).
Apr 9 2025, 2:34 PM

Mar 31 2025

klausman created P74505 (An Untitled Masterwork).
Mar 31 2025, 2:19 PM
klausman created P74501 (An Untitled Masterwork).
Mar 31 2025, 10:15 AM
klausman created P74499 (An Untitled Masterwork).
Mar 31 2025, 9:09 AM

Mar 27 2025

klausman created P74458 (An Untitled Masterwork).
Mar 27 2025, 11:30 AM

Mar 25 2025

klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 25 2025, 5:10 PM · Machine-Learning-Team
klausman created P74417 (An Untitled Masterwork).
Mar 25 2025, 4:08 PM
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 25 2025, 3:02 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 25 2025, 3:01 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 25 2025, 1:33 PM · Machine-Learning-Team
klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 25 2025, 1:32 PM · Machine-Learning-Team

Mar 21 2025

klausman closed T381394: Q2:install SSD (hot swap additions) to ml-lab100[12] as Resolved.

@klausman This has been completed and the drives have been added. Is there anything additional we may need to do on our end?

Mar 21 2025, 1:47 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops

Mar 17 2025

klausman updated the task description for T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 17 2025, 3:13 PM · Machine-Learning-Team

Mar 13 2025

klausman created P74223 (An Untitled Masterwork).
Mar 13 2025, 3:25 PM

Mar 4 2025

klausman created P74063 (An Untitled Masterwork).
Mar 4 2025, 5:21 PM
klausman edited P74051 (An Untitled Masterwork).
Mar 4 2025, 3:56 PM
klausman created P74051 (An Untitled Masterwork).
Mar 4 2025, 3:28 PM
klausman added a comment to T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.

One additional note: we used to use our own Partman recipe (partman/custom/kubernetes-node-overlay-large-kubelet.cfg). Since the larger kubelet partition is already part of the containerd partman recipe (partman/custom/kubernetes-node-containerd.cfg), we don't need to have our own version of that.

Mar 4 2025, 1:40 PM · Machine-Learning-Team
klausman created T387854: Migrate all Lift Wing k8s workers to Bookworm and containerd.
Mar 4 2025, 1:32 PM · Machine-Learning-Team

Mar 3 2025

klausman added a comment to T369493: Migrate ml-staging/ml-serve clusters off of Pod Security Policies.

@klausman we can check what inference DC takes the majority of the traffic and then depool the other one for a couple of hours, it shouldn't be a big deal, capacity wise we are able to handle all traffic from one DC.

Mar 3 2025, 10:39 AM · Patch-For-Review, Machine-Learning-Team, Kubernetes

Feb 28 2025

klausman added a comment to T369493: Migrate ml-staging/ml-serve clusters off of Pod Security Policies.

@klausman @isarantopoulos @achou The only thing that I can think of is the following:

  1. depool eqiad or codfw from inference.discovery.wmnet
  2. manually change an isvc in the depooled DC, and verify the problem with more time (what happens, errors, etc..)
  3. restored and repool once done
Feb 28 2025, 11:20 AM · Patch-For-Review, Machine-Learning-Team, Kubernetes

Feb 25 2025

klausman added a comment to T386969: Upgrade Cassandra clusters to v4.1.8.

@klausman are we OK to upgrade the ml-cache cluster?

Feb 25 2025, 8:55 AM · SecTeam-Processed, Vuln-VulnComponent, Data-Persistence, Cassandra, Infrastructure Security, Security

Feb 24 2025

klausman created P73512 (An Untitled Masterwork).
Feb 24 2025, 2:59 PM
klausman edited P73511 (An Untitled Masterwork).
Feb 24 2025, 2:53 PM
klausman created P73511 (An Untitled Masterwork).
Feb 24 2025, 2:52 PM

Feb 21 2025

klausman created P73501 (An Untitled Masterwork).
Feb 21 2025, 8:48 AM
klausman created P73500 (An Untitled Masterwork).
Feb 21 2025, 8:34 AM