User Details
- User Since
- Oct 20 2025, 12:04 PM (34 w, 5 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- DPogorzelski-WMF [ Global Accounts ]
Thu, Jun 4
i think the original question was whether we had a way to use one endpoint like inference.discovery.wmnet and be routed to either the closest location if a service is present in both or to the one specific location only if the service is only present there without having to hardcode location expectations upfront via inference.svc.eqiad.wmnet.
but now i realize this is simply a geolocation based name resolution so once you have resolved to a specific location you go there and that's it.
Wed, Jun 3
we are already in an active/active setup so it would be enough to have a curl command reference to confirm how the discovery looks like from a client perspective :)
but in that case we would see a good amount of 404 since some models don't exist in codfw, yet we don't, so what i'd like to check is what information is included in the service discovery response. i suspect it contains only the endpoints where the service actually exists
Wed, May 27
sure thing
Thu, May 21
Wed, May 20
Apr 28 2026
Apr 20 2026
tested on edit-check, seems to be working fine
Apr 16 2026
i think the difference lies in the fact that without initContainer field the ClusterStorageContainer is not used at all to construct the storage-initializer container and what is used instead istead are the defaults from the configmap but the securityContext is not carried over
ok it works, it was this missing bit
workloadType: initContainer
in
nvm, i had a typo, it doesn't actually solve anything. i'll keep looking
adding
seccompProfile: type: RuntimeDefault
to the chart values
the issue seems to be solved locally by simply appending the securityContext to the container, but the same doesn't seem to work on staging:
kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
requests:
cpu: "100m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
EOFthe issue can be reproduced locally with a simple kserve "hello world"
Apr 13 2026
Apr 9 2026
Apr 8 2026
cool, i'll try
I'l look into this.
In the last technical we decided to have 8 partitions on 3 hosts and 2 partitions on 1 host.
kserve was updated to 0.17 via another task, closing
I'm going to close this one as it didn't seem to be like this was something that was a desired change from the ml team.
The taints seem to be removed already so it shouldn't be required to specify taint tolerations inside inference services
Apr 7 2026
Mar 31 2026
I'll close the task as we currently have ML workloads using these GPUs.
If any followup is required i'll open a specific task for it.
I will track taint removal in another task.
Mar 26 2026
I will skip that for now as it's getting more complex than i initially anticipated. all services on staging work in the current setup and i'll ship kserve first and then circle back to this part
Latest knative supported by kserve 0.17 seems to require a more recent kubernetes version:
{"severity":"EMERGENCY","timestamp":"2026-03-25T15:53:12.556721777Z","logger":"net-istio-controller","caller":"sharedmain/main.go:463","message":"Version check failed","commit":"1dc9b2d-dirty","knative.dev/pod":"net-istio-controller-6cc6d48947-v2t74","error":"kubernetes version \"1.31.4\" is not compatible, need at least \"1.33.0-0\" (this can be overridden with the env var \"KUBERNETES_MIN_VERSION\")","stacktrace":"knative.dev/pkg/injection/sharedmain.CheckK8sClientMinimumVersionOrDie\n\t/go/github.com/knative/net-istio/vendor/knative.dev/pkg/injection/sharedmain/main.go:463\nknative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/go/github.com/knative/net-istio/vendor/knative.dev/pkg/injection/sharedmain/main.go:271\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/go/github.com/knative/net-istio/vendor/knative.dev/pkg/injection/sharedmain/main.go:226\nmain.main\n\t/go/github.com/knative/net-istio/cmd/controller/main.go:31\nruntime.main\n\t/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.0.linux-amd64/src/runtime/proc.go:283"}Mar 18 2026
This is done @MLechvien-WMF
Mar 13 2026
Awesome! Then I must have done something wrong
Sorry captured the wrong change there, but pretty sure did test on the side with removing the whole entry, can try again though
I can try again but as per screenshot above it's something i have tried and then reverted because it didn't have effect
Mar 12 2026
yea i did test this:
i think i'll re-check this after kserve update, could be pointless trying to fix it if we want to update kserve
Mar 11 2026
i tested it and it always works on first sync, but the problem comes back on following syncs.
will check again
Mar 6 2026
Mar 5 2026
Seems that doesn't matter how you handle it the result is the same.
needs more investigation on the cert-manager side
what works once only:
kubectl delete crd inferenceservices.serving.kserve.io --cascade=true
helmfile -e ml-staging-codfw sync
then the issue comes back.
i will try to remove the
caBundle: Cg== from the chart which is just an empty line
I'll try a few fixes on the side on staging
Mar 2 2026
Feb 12 2026
all etcd machines are updated
Feb 9 2026
roger
Btw the chart does work fine locally for what's worth it. Bartosz also tested it.
The README contains things that need to be added to avoid the issues that you want to fix with small iterations, so since we know it beforehand I am not 100% sure why you want to rediscover them another time.
I don't want to. In fact what I want is to import the chart and deploy it. Then add what is missing on top using the readme and feedback from the deployment as a guideline. Also because not everything is clear to me so this is also a way of absorbing the internal know how.
Will do but I would argue it's much better to deploy it, test it, see what's broken, fix and iterate until it's working as intended. Quick, small iterations.
It's a big chart and planning this waterfall style it's not going to result in something that works 100% not matter how much one spends evaluating the differences.
In the end, even if we don't import the chart we will end up copying a big portion of it anyways because we still need kserve so one way or another it will end up in the repo, perhaps adapted but still.
It seems that the major difference is the fact that we have a calico network policy but the chart doesn't (unsurprisingly). Perhaps we can supply that out of band.
Our images expect /usr/bin/manager but upstream uses /manager and this is not configurable. We might want to update our images to use the upstream path.
Feb 6 2026
@JMeybohm will check and update the ticket, cheers!
Feb 5 2026
to be noted that we already use kserve in the ML context installed via:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/kserve/
and
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/knative-serving/
Jan 30 2026
Jan 14 2026
Dec 22 2025
ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu
i think you can try to remove amd.com/gpu: "1"
Dec 12 2025
Dec 11 2025
Dec 10 2025
Dec 9 2025
Dec 8 2025
Dec 4 2025
let's solve this by removing
4:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sda
[5:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sdc
Dec 2 2025
The above could be false positive, might be happening when the plugin is restarted
perhaps this is relevant:
Most likely, I'm currently looking at the builder machine so will come back to this
I'll look into it.
hmmm doesn't seem like it.
Nov 28 2025
Nov 27 2025
cool, i'll shoot a message in IRC to the sig regarding "You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, with their changelogs etc.."
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
also good point about docker-pkg, let's keep that for now but probably not needed down the road.
- we can also block access to major public registries in the http_proxy or via iptables on the host:
- let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry. regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
- we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:
- wipe the machine and manage the basics with puppet
- the machine will have docker installed
- the machine will be enrolled into gitlab as a gitlab runner
- the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
- SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)
