User Details
- User Since
- Oct 20 2025, 12:04 PM (20 w, 12 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- DPogorzelski-WMF [ Global Accounts ]
Fri, Mar 6
Thu, Mar 5
Seems that doesn't matter how you handle it the result is the same.
needs more investigation on the cert-manager side
what works once only:
kubectl delete crd inferenceservices.serving.kserve.io --cascade=true
helmfile -e ml-staging-codfw sync
then the issue comes back.
i will try to remove the
caBundle: Cg== from the chart which is just an empty line
I'll try a few fixes on the side on staging
Mon, Mar 2
Thu, Feb 12
all etcd machines are updated
Mon, Feb 9
roger
Btw the chart does work fine locally for what's worth it. Bartosz also tested it.
The README contains things that need to be added to avoid the issues that you want to fix with small iterations, so since we know it beforehand I am not 100% sure why you want to rediscover them another time.
I don't want to. In fact what I want is to import the chart and deploy it. Then add what is missing on top using the readme and feedback from the deployment as a guideline. Also because not everything is clear to me so this is also a way of absorbing the internal know how.
Will do but I would argue it's much better to deploy it, test it, see what's broken, fix and iterate until it's working as intended. Quick, small iterations.
It's a big chart and planning this waterfall style it's not going to result in something that works 100% not matter how much one spends evaluating the differences.
In the end, even if we don't import the chart we will end up copying a big portion of it anyways because we still need kserve so one way or another it will end up in the repo, perhaps adapted but still.
It seems that the major difference is the fact that we have a calico network policy but the chart doesn't (unsurprisingly). Perhaps we can supply that out of band.
Our images expect /usr/bin/manager but upstream uses /manager and this is not configurable. We might want to update our images to use the upstream path.
Feb 6 2026
@JMeybohm will check and update the ticket, cheers!
Feb 5 2026
to be noted that we already use kserve in the ML context installed via:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/kserve/
and
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/knative-serving/
Jan 30 2026
Jan 14 2026
Dec 22 2025
ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu
i think you can try to remove amd.com/gpu: "1"
Dec 12 2025
Dec 11 2025
Dec 10 2025
Dec 9 2025
Dec 8 2025
Dec 4 2025
let's solve this by removing
4:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sda
[5:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sdc
Dec 2 2025
The above could be false positive, might be happening when the plugin is restarted
perhaps this is relevant:
Most likely, I'm currently looking at the builder machine so will come back to this
I'll look into it.
hmmm doesn't seem like it.
Nov 28 2025
Nov 27 2025
cool, i'll shoot a message in IRC to the sig regarding "You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, with their changelogs etc.."
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
also good point about docker-pkg, let's keep that for now but probably not needed down the road.
- we can also block access to major public registries in the http_proxy or via iptables on the host:
- let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry. regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
- we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:
- wipe the machine and manage the basics with puppet
- the machine will have docker installed
- the machine will be enrolled into gitlab as a gitlab runner
- the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
- SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)
Nov 21 2025
Nov 12 2025
The ML team needs a place where to store large LLM docker container images and gitlab registry is a solution we would like to test.
At the same time we intend to use gitlab's pipelines so keeping the entire production flow in a single place will streamline the product development workflow.
Nov 10 2025
Nov 7 2025
@Eevans i guess we can just start with a set of shared credentials and split later if needed
regarding 1. i suspect i can just re-use this part https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1029139/
regarding 2. would flipping egress to true here be sufficient? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserve-inference/values.yaml#43 or perhaps a specific policy in GlobalNetworkPolicies in ml-serve.yaml?
Nov 6 2025
Egress seems to be disabled but that could be just the default chart value since some services can clearly make egress calls to fetch models, kafka, etc. need to check
for local workflows it might be good to have it in a docker compose
is Cassandra running on the prod network? if yes it should be reachable at a given address/port with a set of credentials, no?
Nov 4 2025
Oct 30 2025
Cool, then i can keep it public :)
It's a good question, let's say I create a README.md and put in it the summary of the information from this link https://netbox.wikimedia.org/search/?q=ml-&per_page=1000 . The information under netbox is only available after authentication so my reasoning was that it's not intended to be shared with a public audience and therefore any "remix" of that information shouldn't be public either. I'm happy to be wrong :)
Oct 29 2025
@calbon if you can please approve :)
Oct 28 2025
Oct 27 2025
Hello, would it be possible to have it approved?
Thanks!
Oct 24 2025
Invite accepted, 2fa has always been on :)