Page MenuHomePhabricator

DPogorzelski-WMF (Dawid Pogorzelski)
User

Projects (4)

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Oct 20 2025, 12:04 PM (34 w, 5 h)
Availability
Available
LDAP User
Unknown
MediaWiki User
DPogorzelski-WMF [ Global Accounts ]

Recent Activity

Thu, Jun 4

DPogorzelski-WMF added a comment to T422253: Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC).

i think the original question was whether we had a way to use one endpoint like inference.discovery.wmnet and be routed to either the closest location if a service is present in both or to the one specific location only if the service is only present there without having to hardcode location expectations upfront via inference.svc.eqiad.wmnet.
but now i realize this is simply a geolocation based name resolution so once you have resolved to a specific location you go there and that's it.

Thu, Jun 4, 12:40 PM · Traffic, Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Semantic Search, Lift-Wing

Wed, Jun 3

DPogorzelski-WMF added a comment to T422253: Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC).

we are already in an active/active setup so it would be enough to have a curl command reference to confirm how the discovery looks like from a client perspective :)

Wed, Jun 3, 10:20 AM · Traffic, Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Semantic Search, Lift-Wing
DPogorzelski-WMF added a comment to T422253: Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC).

but in that case we would see a good amount of 404 since some models don't exist in codfw, yet we don't, so what i'd like to check is what information is included in the service discovery response. i suspect it contains only the endpoints where the service actually exists

Wed, Jun 3, 8:48 AM · Traffic, Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Semantic Search, Lift-Wing

Wed, May 27

DPogorzelski-WMF added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

sure thing

Wed, May 27, 9:08 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review

Thu, May 21

DPogorzelski-WMF claimed T381883: Requesting write access to ml-serve-{eqiad,codfw} for ML team.
Thu, May 21, 1:51 PM · Essential-Work, Lift-Wing, Machine-Learning-Team

Wed, May 20

DPogorzelski-WMF updated the task description for T426823: Update prod kserve/knative.
Wed, May 20, 11:51 AM · ServiceOps new, Machine-Learning-Team (Q4 FY2025-26)
DPogorzelski-WMF created T426823: Update prod kserve/knative.
Wed, May 20, 8:40 AM · ServiceOps new, Machine-Learning-Team (Q4 FY2025-26)
DPogorzelski-WMF moved T419722: Experiment with new kserve version on ml-staging-codfw from Q4 FY2025-26 to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Wed, May 20, 8:38 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
DPogorzelski-WMF closed T419722: Experiment with new kserve version on ml-staging-codfw as Resolved.
Wed, May 20, 8:37 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review

Apr 28 2026

DPogorzelski-WMF closed T420507: MI300 machines need startup tweaks as Resolved.
Apr 28 2026, 1:27 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work
DPogorzelski-WMF closed T420507: MI300 machines need startup tweaks, a subtask of T424322: Setup MI300X nodes for LLM serving, as Resolved.
Apr 28 2026, 1:27 PM · Goal, Machine-Learning-Team (Q4 FY2025-26)
DPogorzelski-WMF moved T420507: MI300 machines need startup tweaks from Q4 FY2025-26 to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Apr 28 2026, 1:26 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work

Apr 20 2026

DPogorzelski-WMF added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

tested on edit-check, seems to be working fine

Apr 20 2026, 9:23 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
DPogorzelski-WMF moved T423149: Fix securityContext propagation in liftwing from Q4 FY2025-26 to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Apr 20 2026, 9:01 AM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF closed T423149: Fix securityContext propagation in liftwing as Resolved.
Apr 20 2026, 8:59 AM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing

Apr 16 2026

DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

i think the difference lies in the fact that without initContainer field the ClusterStorageContainer is not used at all to construct the storage-initializer container and what is used instead istead are the defaults from the configmap but the securityContext is not carried over

Apr 16 2026, 3:35 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

ok it works, it was this missing bit
workloadType: initContainer
in

Apr 16 2026, 3:21 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

Is there a diff in this call between local and staging?

kubectl get mutatingwebhookconfiguration -n kserve -o json | \
  jq -r '.items[] | .metadata.name as $name | .webhooks[] | 
  select(.rules[].resources[] | contains("pods")) | 
  "\($name) -> \(.name)"'

I am wondering if, for some reason, there is a race condition in the webhooks..

Apr 16 2026, 1:50 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

nvm, i had a typo, it doesn't actually solve anything. i'll keep looking

Apr 16 2026, 1:47 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

adding

seccompProfile:
  type: RuntimeDefault

to the chart values

Apr 16 2026, 1:22 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

the issue seems to be solved locally by simply appending the securityContext to the container, but the same doesn't seem to work on staging:

kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
  namespace: kserve-test
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests:
          cpu: "100m"
          memory: "512Mi"
        limits:
          cpu: "1"
          memory: "1Gi"
      securityContext:
        runAsNonRoot: true
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        seccompProfile:
          type: RuntimeDefault
EOF
Apr 16 2026, 12:45 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF added a comment to T423149: Fix securityContext propagation in liftwing.

the issue can be reproduced locally with a simple kserve "hello world"

Apr 16 2026, 8:32 AM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing

Apr 13 2026

DPogorzelski-WMF added projects to T423149: Fix securityContext propagation in liftwing: Lift-Wing, SRE.
Apr 13 2026, 3:15 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF moved T423149: Fix securityContext propagation in liftwing from Ready To Go to In Progress on the Machine-Learning-Team board.
Apr 13 2026, 3:14 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF created T423149: Fix securityContext propagation in liftwing.
Apr 13 2026, 3:14 PM · Machine-Learning-Team (Q4 FY2025-26), SRE, Lift-Wing
DPogorzelski-WMF moved T421924: [SRE] Remove mi300 node taints from Q4 FY2025-26 to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Apr 13 2026, 3:11 PM · Machine-Learning-Team, Essential-Work
DPogorzelski-WMF moved T422253: Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC) from Q4 FY2025-26 to Ready To Go on the Machine-Learning-Team board.
Apr 13 2026, 3:11 PM · Traffic, Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Semantic Search, Lift-Wing

Apr 9 2026

DPogorzelski-WMF changed the status of T420507: MI300 machines need startup tweaks from Open to In Progress.
Apr 9 2026, 8:42 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work
DPogorzelski-WMF moved T420507: MI300 machines need startup tweaks from Q4 FY2025-26 to In Progress on the Machine-Learning-Team board.
Apr 9 2026, 8:41 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work

Apr 8 2026

DPogorzelski-WMF reassigned T400626: Q4:rack/setup/install ml-serve101[45] from Jclark-ctr to klausman.
Apr 8 2026, 1:52 PM · Recommendation-API, SRE, ops-eqiad, DC-Ops
DPogorzelski-WMF added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

cool, i'll try

Apr 8 2026, 1:27 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
DPogorzelski-WMF added a comment to T420507: MI300 machines need startup tweaks.

I'l look into this.
In the last technical we decided to have 8 partitions on 3 hosts and 2 partitions on 1 host.

Apr 8 2026, 1:25 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work
DPogorzelski-WMF claimed T420507: MI300 machines need startup tweaks.
Apr 8 2026, 1:20 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review, Essential-Work
DPogorzelski-WMF claimed T422253: Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC).
Apr 8 2026, 1:08 PM · Traffic, Machine-Learning-Team (Q4 FY2025-26), Essential-Work, Semantic Search, Lift-Wing
DPogorzelski-WMF moved T367048: Update kserve to 0.15.2 from Ready To Go to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Apr 8 2026, 8:32 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF closed T367048: Update kserve to 0.15.2 as Declined.

kserve was updated to 0.17 via another task, closing

Apr 8 2026, 8:31 AM · Essential-Work, Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF moved T408690: Move inference-services repo from Gerrit to GitLab from Ready To Go to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Apr 8 2026, 8:31 AM · Machine-Learning-Team
DPogorzelski-WMF closed T408690: Move inference-services repo from Gerrit to GitLab as Declined.

I'm going to close this one as it didn't seem to be like this was something that was a desired change from the ml team.

Apr 8 2026, 8:30 AM · Machine-Learning-Team
DPogorzelski-WMF closed T421924: [SRE] Remove mi300 node taints as Resolved.

The taints seem to be removed already so it shouldn't be required to specify taint tolerations inside inference services

Apr 8 2026, 8:24 AM · Machine-Learning-Team, Essential-Work
DPogorzelski-WMF changed the status of T421924: [SRE] Remove mi300 node taints from Open to In Progress.
Apr 8 2026, 7:48 AM · Machine-Learning-Team, Essential-Work

Apr 7 2026

DPogorzelski-WMF closed T419722: Experiment with new kserve version on ml-staging-codfw as Resolved.
Apr 7 2026, 3:02 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review

Mar 31 2026

DPogorzelski-WMF created T421924: [SRE] Remove mi300 node taints.
Mar 31 2026, 3:26 PM · Machine-Learning-Team, Essential-Work
DPogorzelski-WMF closed T403599: Setup & experiments for MI300x GPUs used for LiftWing as Resolved.
Mar 31 2026, 3:25 PM · Machine-Learning-Team
DPogorzelski-WMF moved T403599: Setup & experiments for MI300x GPUs used for LiftWing from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Mar 31 2026, 3:25 PM · Machine-Learning-Team
DPogorzelski-WMF moved T403697: Experiment with amd-smi and the new AMD GPUs MI300x from In Progress to 2025-2026 Q2 Done on the Machine-Learning-Team board.
Mar 31 2026, 3:25 PM · Machine-Learning-Team
DPogorzelski-WMF closed T403697: Experiment with amd-smi and the new AMD GPUs MI300x, a subtask of T403599: Setup & experiments for MI300x GPUs used for LiftWing, as Resolved.
Mar 31 2026, 3:24 PM · Machine-Learning-Team
DPogorzelski-WMF closed T403697: Experiment with amd-smi and the new AMD GPUs MI300x as Resolved.
Mar 31 2026, 3:24 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403697: Experiment with amd-smi and the new AMD GPUs MI300x.

I'll close the task as we currently have ML workloads using these GPUs.
If any followup is required i'll open a specific task for it.
I will track taint removal in another task.

Mar 31 2026, 3:23 PM · Machine-Learning-Team

Mar 26 2026

DPogorzelski-WMF added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

I will skip that for now as it's getting more complex than i initially anticipated. all services on staging work in the current setup and i'll ship kserve first and then circle back to this part

Mar 26 2026, 1:33 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
DPogorzelski-WMF added a comment to T419722: Experiment with new kserve version on ml-staging-codfw.

Latest knative supported by kserve 0.17 seems to require a more recent kubernetes version:

{"severity":"EMERGENCY","timestamp":"2026-03-25T15:53:12.556721777Z","logger":"net-istio-controller","caller":"sharedmain/main.go:463","message":"Version check failed","commit":"1dc9b2d-dirty","knative.dev/pod":"net-istio-controller-6cc6d48947-v2t74","error":"kubernetes version \"1.31.4\" is not compatible, need at least \"1.33.0-0\" (this can be overridden with the env var \"KUBERNETES_MIN_VERSION\")","stacktrace":"knative.dev/pkg/injection/sharedmain.CheckK8sClientMinimumVersionOrDie\n\t/go/github.com/knative/net-istio/vendor/knative.dev/pkg/injection/sharedmain/main.go:463\nknative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/go/github.com/knative/net-istio/vendor/knative.dev/pkg/injection/sharedmain/main.go:271\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/go/github.com/knative/net-istio/vendor/knative.dev/pkg/injection/sharedmain/main.go:226\nmain.main\n\t/go/github.com/knative/net-istio/cmd/controller/main.go:31\nruntime.main\n\t/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.0.linux-amd64/src/runtime/proc.go:283"}
Mar 26 2026, 8:33 AM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review

Mar 18 2026

DPogorzelski-WMF added a comment to T414485: Upgrade ML clusters to kubernetes 1.31.

This is done @MLechvien-WMF

Mar 18 2026, 1:06 PM · ServiceOps new, Machine-Learning-Team, Kubernetes, Prod-Kubernetes

Mar 13 2026

DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

Awesome! Then I must have done something wrong

Mar 13 2026, 9:40 AM · Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

Sorry captured the wrong change there, but pretty sure did test on the side with removing the whole entry, can try again though

Mar 13 2026, 9:27 AM · Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

I can try again but as per screenshot above it's something i have tried and then reverted because it didn't have effect

Mar 13 2026, 9:23 AM · Patch-For-Review, Machine-Learning-Team

Mar 12 2026

DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

yea i did test this:

image.png (1,114×1,702 px, 248 KB)

i think i'll re-check this after kserve update, could be pointless trying to fix it if we want to update kserve

Mar 12 2026, 1:40 PM · Patch-For-Review, Machine-Learning-Team

Mar 11 2026

DPogorzelski-WMF created T419722: Experiment with new kserve version on ml-staging-codfw.
Mar 11 2026, 3:26 PM · Machine-Learning-Team (Q4 FY2025-26), Patch-For-Review
DPogorzelski-WMF closed T418722: Incident: 2026-02-23 ml-serve as Resolved.
Mar 11 2026, 2:44 PM · Machine-Learning-Team
DPogorzelski-WMF removed a subtask for T418722: Incident: 2026-02-23 ml-serve: T419040: kserve helm status is broken across ml clusters.
Mar 11 2026, 2:43 PM · Machine-Learning-Team
DPogorzelski-WMF removed a parent task for T419040: kserve helm status is broken across ml clusters: T418722: Incident: 2026-02-23 ml-serve.
Mar 11 2026, 2:43 PM · Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

i tested it and it always works on first sync, but the problem comes back on following syncs.
will check again

Mar 11 2026, 2:42 PM · Patch-For-Review, Machine-Learning-Team

Mar 6 2026

DPogorzelski-WMF added a subtask for T398948: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements: T419235: Fix revertrisk Pyrra SLO.
Mar 6 2026, 1:35 PM · Goal, Machine-Learning-Team
DPogorzelski-WMF added a parent task for T419235: Fix revertrisk Pyrra SLO: T398948: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements.
Mar 6 2026, 1:35 PM · Machine-Learning-Team
DPogorzelski-WMF created T419235: Fix revertrisk Pyrra SLO.
Mar 6 2026, 1:34 PM · Machine-Learning-Team

Mar 5 2026

DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

Seems that doesn't matter how you handle it the result is the same.
needs more investigation on the cert-manager side

Mar 5 2026, 2:08 PM · Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

what works once only:
kubectl delete crd inferenceservices.serving.kserve.io --cascade=true
helmfile -e ml-staging-codfw sync
then the issue comes back.
i will try to remove the
caBundle: Cg== from the chart which is just an empty line

Mar 5 2026, 1:10 PM · Patch-For-Review, Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

I'll try a few fixes on the side on staging

Mar 5 2026, 12:40 PM · Patch-For-Review, Machine-Learning-Team

Mar 2 2026

DPogorzelski-WMF closed T394778: Build and push images to the docker registry from ml-lab as Resolved.
Mar 2 2026, 3:13 PM · Machine-Learning-Team
DPogorzelski-WMF created T418722: Incident: 2026-02-23 ml-serve.
Mar 2 2026, 10:06 AM · Machine-Learning-Team

Feb 12 2026

DPogorzelski-WMF added a comment to T414485: Upgrade ML clusters to kubernetes 1.31.

all etcd machines are updated

Feb 12 2026, 12:59 PM · ServiceOps new, Machine-Learning-Team, Kubernetes, Prod-Kubernetes

Feb 9 2026

DPogorzelski-WMF added a comment to T414485: Upgrade ML clusters to kubernetes 1.31.

roger

Feb 9 2026, 4:33 PM · ServiceOps new, Machine-Learning-Team, Kubernetes, Prod-Kubernetes
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

Btw the chart does work fine locally for what's worth it. Bartosz also tested it.

Feb 9 2026, 12:39 PM · Kubernetes, SRE
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

The README contains things that need to be added to avoid the issues that you want to fix with small iterations, so since we know it beforehand I am not 100% sure why you want to rediscover them another time.

I don't want to. In fact what I want is to import the chart and deploy it. Then add what is missing on top using the readme and feedback from the deployment as a guideline. Also because not everything is clear to me so this is also a way of absorbing the internal know how.

Feb 9 2026, 12:31 PM · Kubernetes, SRE
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

Will do but I would argue it's much better to deploy it, test it, see what's broken, fix and iterate until it's working as intended. Quick, small iterations.
It's a big chart and planning this waterfall style it's not going to result in something that works 100% not matter how much one spends evaluating the differences.
In the end, even if we don't import the chart we will end up copying a big portion of it anyways because we still need kserve so one way or another it will end up in the repo, perhaps adapted but still.

Feb 9 2026, 11:44 AM · Kubernetes, SRE
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

It seems that the major difference is the fact that we have a calico network policy but the chart doesn't (unsurprisingly). Perhaps we can supply that out of band.
Our images expect /usr/bin/manager but upstream uses /manager and this is not configurable. We might want to update our images to use the upstream path.

Feb 9 2026, 9:52 AM · Kubernetes, SRE

Feb 6 2026

DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

@JMeybohm will check and update the ticket, cheers!

Feb 6 2026, 6:12 PM · Kubernetes, SRE
DPogorzelski-WMF closed T412524: New WMF docker registry credentials, a subtask of T394778: Build and push images to the docker registry from ml-lab, as Resolved.
Feb 6 2026, 12:23 PM · Machine-Learning-Team
DPogorzelski-WMF closed T412524: New WMF docker registry credentials as Resolved.
Feb 6 2026, 12:23 PM · Kubernetes, ServiceOps new, SRE

Feb 5 2026

DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

to be noted that we already use kserve in the ML context installed via:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/kserve/
and
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/knative-serving/

Feb 5 2026, 11:56 AM · Kubernetes, SRE
DPogorzelski-WMF created T416580: Kserve helm chart.
Feb 5 2026, 11:55 AM · Kubernetes, SRE

Jan 30 2026

DPogorzelski-WMF added a comment to P87998 amd-rocm70 directory exists in the Wikimedia APT repo, but the packages are missing.

fixed https://apt-browser.toolforge.org/bookworm-wikimedia/thirdparty/amd-rocm70/

Jan 30 2026, 9:10 AM

Jan 14 2026

DPogorzelski-WMF created T414576: Failing docker registry httpbb tests.
Jan 14 2026, 1:13 PM · Kubernetes, ServiceOps new, SRE

Dec 22 2025

DPogorzelski-WMF added a comment to P86741 embeddings isvc deployment in experimental ns failing because of insufficient GPUs.

ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu

Dec 22 2025, 10:20 AM · Machine-Learning-Team
DPogorzelski-WMF added a comment to P86741 embeddings isvc deployment in experimental ns failing because of insufficient GPUs.

i think you can try to remove amd.com/gpu: "1"

Dec 22 2025, 10:10 AM · Machine-Learning-Team

Dec 12 2025

DPogorzelski-WMF created T412524: New WMF docker registry credentials.
Dec 12 2025, 2:18 PM · Kubernetes, ServiceOps new, SRE

Dec 11 2025

DPogorzelski-WMF reassigned T412357: Install AMD GPU + torch version of ML Labs machines from DPogorzelski-WMF to klausman.
Dec 11 2025, 11:38 AM · Machine-Learning-Team

Dec 10 2025

DPogorzelski-WMF created T412213: Relabel ml-lab1001->ml-build1001.
Dec 10 2025, 12:42 PM · DC-Ops

Dec 9 2025

DPogorzelski-WMF closed T411993: dpogorzelski gpg key as Resolved.
Dec 9 2025, 9:46 AM · SRE

Dec 8 2025

DPogorzelski-WMF created T411993: dpogorzelski gpg key.
Dec 8 2025, 10:02 AM · SRE

Dec 4 2025

DPogorzelski-WMF added a comment to T411753: Wrong disk order on ml-lab1001?.

let's solve this by removing
4:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sda
[5:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sdc

Dec 4 2025, 12:51 PM · SRE, ops-eqiad, DC-Ops
DPogorzelski-WMF created T411753: Wrong disk order on ml-lab1001?.
Dec 4 2025, 9:05 AM · SRE, ops-eqiad, DC-Ops

Dec 2 2025

DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

The above could be false positive, might be happening when the plugin is restarted

Dec 2 2025, 1:58 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

perhaps this is relevant:

Dec 2 2025, 1:46 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

Most likely, I'm currently looking at the builder machine so will come back to this

Dec 2 2025, 12:43 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated ML machine and with the precautions discussed, we are OK with moving forward and unblocking this.

  • we can also block access to major public registries in the http_proxy or via iptables on the host:

iptables -A OUTPUT -d your-internal-registry.com -j ACCEPT

iptables -A OUTPUT -d registry-1.docker.io -j REJECT
iptables -A OUTPUT -d index.docker.io -j REJECT
iptables -A OUTPUT -d quay.io -j REJECT
iptables -A OUTPUT -d ghcr.io -j REJECT
iptables -A OUTPUT -d gcr.io -j REJECT

while iptables rules can be changed by people i trust everyone in the team so this is mostly to prevent shooting ourselves in the foot and pulling from outside by accident

The machines will need specific Docker configuration anyway (setting up the proxy for all operations) to be able to reach to the outside, this is probably not needed. And if someone decides to mess with the configuration (which requires root) and fetch outside image, no iptables rule would save us.

  • let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry.

It would be nice if it could also push only under a specific hierarchy, e.g. /repos/<insert-start-of-ml-hierarchy>/. (/repos being the start of the Gitlab managed hierarchy of Docker images IIRC). We already have /releng (and dedicated username/password pairs for that) so there is prior art.

I'll look into it.

Dec 2 2025, 10:01 AM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

hmmm doesn't seem like it.

Dec 2 2025, 8:24 AM · Machine-Learning-Team

Nov 28 2025

DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:

  • wipe the machine and manage the basics with puppet
  • the machine will have docker installed
  • the machine will be enrolled into gitlab as a gitlab runner
  • the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
  • SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)

If the above is fine i'm going to start looking at the first steps.
Feel free to comment or add interested parties to the discussion.

Thank you for picking this up, @DPogorzelski-WMF. If you proceed with the plan to wipe ml-lab1001, could you please move the contents of my (and/or other people's) home directory to ml-lab1002? Thanks in advance.

Nov 28 2025, 1:14 PM · Machine-Learning-Team

Nov 27 2025

DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

cool, i'll shoot a message in IRC to the sig regarding "You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, with their changelogs etc.."
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
also good point about docker-pkg, let's keep that for now but probably not needed down the road.

Nov 27 2025, 3:34 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.
  • we can also block access to major public registries in the http_proxy or via iptables on the host:
Nov 27 2025, 1:30 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.
  • let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry. regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
  • we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
Nov 27 2025, 1:25 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:

  • wipe the machine and manage the basics with puppet
  • the machine will have docker installed
  • the machine will be enrolled into gitlab as a gitlab runner
  • the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
  • SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)
Nov 27 2025, 9:16 AM · Machine-Learning-Team