Page MenuHomePhabricator

DPogorzelski-WMF (Dawid Pogorzelski)
User

Projects (2)

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Oct 20 2025, 12:04 PM (20 w, 12 h)
Availability
Available
LDAP User
Unknown
MediaWiki User
DPogorzelski-WMF [ Global Accounts ]

Recent Activity

Fri, Mar 6

DPogorzelski-WMF added a subtask for T398948: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements: T419235: Fix revertrisk Pyrra SLO.
Fri, Mar 6, 1:35 PM · Goal, Machine-Learning-Team
DPogorzelski-WMF added a parent task for T419235: Fix revertrisk Pyrra SLO: T398948: Q1 FY2025-26 Goal: Operational Excellence - LiftWing Platform Updates & Improvements.
Fri, Mar 6, 1:35 PM · Machine-Learning-Team
DPogorzelski-WMF created T419235: Fix revertrisk Pyrra SLO.
Fri, Mar 6, 1:34 PM · Machine-Learning-Team

Thu, Mar 5

DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

Seems that doesn't matter how you handle it the result is the same.
needs more investigation on the cert-manager side

Thu, Mar 5, 2:08 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

what works once only:
kubectl delete crd inferenceservices.serving.kserve.io --cascade=true
helmfile -e ml-staging-codfw sync
then the issue comes back.
i will try to remove the
caBundle: Cg== from the chart which is just an empty line

Thu, Mar 5, 1:10 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T419040: kserve helm status is broken across ml clusters.

I'll try a few fixes on the side on staging

Thu, Mar 5, 12:40 PM · Machine-Learning-Team

Mon, Mar 2

DPogorzelski-WMF closed T394778: Build and push images to the docker registry from ml-lab as Resolved.
Mon, Mar 2, 3:13 PM · Machine-Learning-Team
DPogorzelski-WMF created T418722: Incident: 2026-02-23 ml-serve.
Mon, Mar 2, 10:06 AM · Machine-Learning-Team

Thu, Feb 12

DPogorzelski-WMF added a comment to T414485: Upgrade ML clusters to kubernetes 1.31.

all etcd machines are updated

Thu, Feb 12, 12:59 PM · Machine-Learning-Team, Kubernetes, Prod-Kubernetes

Mon, Feb 9

DPogorzelski-WMF added a comment to T414485: Upgrade ML clusters to kubernetes 1.31.

roger

Mon, Feb 9, 4:33 PM · Machine-Learning-Team, Kubernetes, Prod-Kubernetes
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

Btw the chart does work fine locally for what's worth it. Bartosz also tested it.

Mon, Feb 9, 12:39 PM · Kubernetes, SRE
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

The README contains things that need to be added to avoid the issues that you want to fix with small iterations, so since we know it beforehand I am not 100% sure why you want to rediscover them another time.

I don't want to. In fact what I want is to import the chart and deploy it. Then add what is missing on top using the readme and feedback from the deployment as a guideline. Also because not everything is clear to me so this is also a way of absorbing the internal know how.

Mon, Feb 9, 12:31 PM · Kubernetes, SRE
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

Will do but I would argue it's much better to deploy it, test it, see what's broken, fix and iterate until it's working as intended. Quick, small iterations.
It's a big chart and planning this waterfall style it's not going to result in something that works 100% not matter how much one spends evaluating the differences.
In the end, even if we don't import the chart we will end up copying a big portion of it anyways because we still need kserve so one way or another it will end up in the repo, perhaps adapted but still.

Mon, Feb 9, 11:44 AM · Kubernetes, SRE
DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

It seems that the major difference is the fact that we have a calico network policy but the chart doesn't (unsurprisingly). Perhaps we can supply that out of band.
Our images expect /usr/bin/manager but upstream uses /manager and this is not configurable. We might want to update our images to use the upstream path.

Mon, Feb 9, 9:52 AM · Kubernetes, SRE

Feb 6 2026

DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

@JMeybohm will check and update the ticket, cheers!

Feb 6 2026, 6:12 PM · Kubernetes, SRE
DPogorzelski-WMF closed T412524: New WMF docker registry credentials, a subtask of T394778: Build and push images to the docker registry from ml-lab, as Resolved.
Feb 6 2026, 12:23 PM · Machine-Learning-Team
DPogorzelski-WMF closed T412524: New WMF docker registry credentials as Resolved.
Feb 6 2026, 12:23 PM · Kubernetes, ServiceOps new, SRE

Feb 5 2026

DPogorzelski-WMF added a comment to T416580: Kserve helm chart.

to be noted that we already use kserve in the ML context installed via:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/kserve/
and
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/knative-serving/

Feb 5 2026, 11:56 AM · Kubernetes, SRE
DPogorzelski-WMF created T416580: Kserve helm chart.
Feb 5 2026, 11:55 AM · Kubernetes, SRE

Jan 30 2026

DPogorzelski-WMF added a comment to P87998 amd-rocm70 directory exists in the Wikimedia APT repo, but the packages are missing.

fixed https://apt-browser.toolforge.org/bookworm-wikimedia/thirdparty/amd-rocm70/

Jan 30 2026, 9:10 AM

Jan 14 2026

DPogorzelski-WMF created T414576: Failing docker registry httpbb tests.
Jan 14 2026, 1:13 PM · Kubernetes, ServiceOps new, SRE

Dec 22 2025

DPogorzelski-WMF added a comment to P86741 embeddings isvc deployment in experimental ns failing because of insufficient GPUs.

ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu

Dec 22 2025, 10:20 AM · Machine-Learning-Team
DPogorzelski-WMF added a comment to P86741 embeddings isvc deployment in experimental ns failing because of insufficient GPUs.

i think you can try to remove amd.com/gpu: "1"

Dec 22 2025, 10:10 AM · Machine-Learning-Team

Dec 12 2025

DPogorzelski-WMF created T412524: New WMF docker registry credentials.
Dec 12 2025, 2:18 PM · Kubernetes, ServiceOps new, SRE

Dec 11 2025

DPogorzelski-WMF reassigned T412357: Install AMD GPU + torch version of ML Labs machines from DPogorzelski-WMF to klausman.
Dec 11 2025, 11:38 AM · Machine-Learning-Team

Dec 10 2025

DPogorzelski-WMF created T412213: Relabel ml-lab1001->ml-build1001.
Dec 10 2025, 12:42 PM · DC-Ops

Dec 9 2025

DPogorzelski-WMF closed T411993: dpogorzelski gpg key as Resolved.
Dec 9 2025, 9:46 AM · SRE

Dec 8 2025

DPogorzelski-WMF created T411993: dpogorzelski gpg key.
Dec 8 2025, 10:02 AM · SRE

Dec 4 2025

DPogorzelski-WMF added a comment to T411753: Wrong disk order on ml-lab1001?.

let's solve this by removing
4:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sda
[5:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sdc

Dec 4 2025, 12:51 PM · SRE, ops-eqiad, DC-Ops
DPogorzelski-WMF created T411753: Wrong disk order on ml-lab1001?.
Dec 4 2025, 9:05 AM · SRE, ops-eqiad, DC-Ops

Dec 2 2025

DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

The above could be false positive, might be happening when the plugin is restarted

Dec 2 2025, 1:58 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

perhaps this is relevant:

Dec 2 2025, 1:46 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

Most likely, I'm currently looking at the builder machine so will come back to this

Dec 2 2025, 12:43 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated ML machine and with the precautions discussed, we are OK with moving forward and unblocking this.

  • we can also block access to major public registries in the http_proxy or via iptables on the host:

iptables -A OUTPUT -d your-internal-registry.com -j ACCEPT

iptables -A OUTPUT -d registry-1.docker.io -j REJECT
iptables -A OUTPUT -d index.docker.io -j REJECT
iptables -A OUTPUT -d quay.io -j REJECT
iptables -A OUTPUT -d ghcr.io -j REJECT
iptables -A OUTPUT -d gcr.io -j REJECT

while iptables rules can be changed by people i trust everyone in the team so this is mostly to prevent shooting ourselves in the foot and pulling from outside by accident

The machines will need specific Docker configuration anyway (setting up the proxy for all operations) to be able to reach to the outside, this is probably not needed. And if someone decides to mess with the configuration (which requires root) and fetch outside image, no iptables rule would save us.

  • let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry.

It would be nice if it could also push only under a specific hierarchy, e.g. /repos/<insert-start-of-ml-hierarchy>/. (/repos being the start of the Gitlab managed hierarchy of Docker images IIRC). We already have /releng (and dedicated username/password pairs for that) so there is prior art.

I'll look into it.

Dec 2 2025, 10:01 AM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T403599: Setup & experiments for MI300x GPUs used for LiftWing.

hmmm doesn't seem like it.

Dec 2 2025, 8:24 AM · Machine-Learning-Team

Nov 28 2025

DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:

  • wipe the machine and manage the basics with puppet
  • the machine will have docker installed
  • the machine will be enrolled into gitlab as a gitlab runner
  • the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
  • SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)

If the above is fine i'm going to start looking at the first steps.
Feel free to comment or add interested parties to the discussion.

Thank you for picking this up, @DPogorzelski-WMF. If you proceed with the plan to wipe ml-lab1001, could you please move the contents of my (and/or other people's) home directory to ml-lab1002? Thanks in advance.

Nov 28 2025, 1:14 PM · Machine-Learning-Team

Nov 27 2025

DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

cool, i'll shoot a message in IRC to the sig regarding "You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, with their changelogs etc.."
inference-services repo will most likely move to gitlab so we can probably store the specific ML dockerfiles there.
also good point about docker-pkg, let's keep that for now but probably not needed down the road.

Nov 27 2025, 3:34 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.
  • we can also block access to major public registries in the http_proxy or via iptables on the host:
Nov 27 2025, 1:30 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.
  • let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we can take the gitlab enrollment in a second step. but just so you know, a gitlab runner can be tied to specific groups or even specific repos making it unavailable for anyone/anything outside of that scope. so this won't be a shared runner but rather an ML only one. in other words it would only accept jobs from ML specific repos and push to the WMF registry. regarding "making sure no weird stuff is pushed to the internal registry" i don't have an immediate solution beyond: due diligence, Gitlab CI steps blocking merge requests containing images from external sources. on a related note though, afaik we still use pip to install python dependencies from outside so we are not fully isolated/immune to the supply chain issues
  • we could simply set ip: 127.0.0.1 in /etc/docker/daemon.json so that you can't bind containers to 0.0.0.0, this effectively disarms anything left running, also the machine is not exposed to the outside afaik.
Nov 27 2025, 1:25 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T394778: Build and push images to the docker registry from ml-lab.

I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team.
Let's start with the basics:

  • wipe the machine and manage the basics with puppet
  • the machine will have docker installed
  • the machine will be enrolled into gitlab as a gitlab runner
  • the machine should be able to push images to the current WMF registry ( we can go back to investigate a proper registry solution once the build machine is ready otherwise there are too many topics flying around)
  • SSH root access for ML SRE's and non-root access for the ML team. this however should be an exception, most of the time the builder can be used via plain Gitlab Pipelines so SSH shouldn't be needed; we can repurpose the other lab machine down the road as an experiment playground, one that is not allowed to publish any image anywhere so that the ML team can actually experiment with build steps more freely (WMF needs to learn to trust people it hires and security needs to work in function of the teams/projects not the other way around)
Nov 27 2025, 9:16 AM · Machine-Learning-Team

Nov 21 2025

DPogorzelski-WMF updated the task description for T410752: Requesting access to ml-lab-users for ttaylor.
Nov 21 2025, 3:24 PM · SRE, SRE-Access-Requests
DPogorzelski-WMF created T410752: Requesting access to ml-lab-users for ttaylor.
Nov 21 2025, 3:18 PM · SRE, SRE-Access-Requests

Nov 12 2025

DPogorzelski-WMF added a comment to T304845: gitlab: consider enabling docker container registry.

The ML team needs a place where to store large LLM docker container images and gitlab registry is a solution we would like to test.
At the same time we intend to use gitlab's pipelines so keeping the entire production flow in a single place will streamline the product development workflow.

Nov 12 2025, 4:08 PM · collaboration-services, cloud-services-team, Release-Engineering-Team (Priority Backlog 📥), GitLab (Administration, Settings & Policy)

Nov 10 2025

DPogorzelski-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

@achou which services should be able to connect to cassandra? to know where to enable egress
@Eevans I would need to know the cassandra endpoint and possible a set of credentials :)

Nov 10 2025, 2:05 PM · Machine-Learning-Team

Nov 7 2025

DPogorzelski-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

@Eevans i guess we can just start with a set of shared credentials and split later if needed

Nov 7 2025, 4:23 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

regarding 1. i suspect i can just re-use this part https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1029139/
regarding 2. would flipping egress to true here be sufficient? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserve-inference/values.yaml#43 or perhaps a specific policy in GlobalNetworkPolicies in ml-serve.yaml?

Nov 7 2025, 4:19 PM · Machine-Learning-Team

Nov 6 2025

DPogorzelski-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

Egress seems to be disabled but that could be just the default chart value since some services can clearly make egress calls to fetch models, kafka, etc. need to check

Nov 6 2025, 1:34 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

for local workflows it might be good to have it in a docker compose

Nov 6 2025, 12:22 PM · Machine-Learning-Team
DPogorzelski-WMF added a comment to T409414: Configure Lift Wing isvc Integration with Cassandra.

is Cassandra running on the prod network? if yes it should be reachable at a given address/port with a set of credentials, no?

Nov 6 2025, 12:21 PM · Machine-Learning-Team

Nov 4 2025

DPogorzelski-WMF moved T408702: Promote dpogorzelski from ops-limited to ops from Unsorted to Blocked on the Machine-Learning-Team board.
Nov 4 2025, 7:31 AM · SRE, SRE-Access-Requests, Machine-Learning-Team

Oct 30 2025

DPogorzelski-WMF closed T408784: GitLab Private Repository Request for: ML infrastructure repository as Resolved.

Cool, then i can keep it public :)

Oct 30 2025, 4:31 PM · GitLab
DPogorzelski-WMF assigned T408702: Promote dpogorzelski from ops-limited to ops to mark.
Oct 30 2025, 11:29 AM · SRE, SRE-Access-Requests, Machine-Learning-Team
DPogorzelski-WMF closed T408788: Posix group membership: dpogorzelski ->ml-lab-users as Invalid.
Oct 30 2025, 10:34 AM · SRE, SRE-Access-Requests
DPogorzelski-WMF created T408788: Posix group membership: dpogorzelski ->ml-lab-users.
Oct 30 2025, 10:21 AM · SRE, SRE-Access-Requests
DPogorzelski-WMF added a comment to T408784: GitLab Private Repository Request for: ML infrastructure repository.

It's a good question, let's say I create a README.md and put in it the summary of the information from this link https://netbox.wikimedia.org/search/?q=ml-&per_page=1000 . The information under netbox is only available after authentication so my reasoning was that it's not intended to be shared with a public audience and therefore any "remix" of that information shouldn't be public either. I'm happy to be wrong :)

Oct 30 2025, 9:36 AM · GitLab
DPogorzelski-WMF created T408784: GitLab Private Repository Request for: ML infrastructure repository.
Oct 30 2025, 9:24 AM · GitLab

Oct 29 2025

DPogorzelski-WMF added a comment to T408579: Add dpogorzelski to ML and Data Platform posix groups.

@calbon if you can please approve :)

Oct 29 2025, 3:46 PM · Data-Engineering, SRE, SRE-Access-Requests
DPogorzelski-WMF updated subscribers of T408579: Add dpogorzelski to ML and Data Platform posix groups.
Oct 29 2025, 3:45 PM · Data-Engineering, SRE, SRE-Access-Requests
DPogorzelski-WMF created T408702: Promote dpogorzelski from ops-limited to ops.
Oct 29 2025, 3:06 PM · SRE, SRE-Access-Requests, Machine-Learning-Team

Oct 28 2025

DPogorzelski-WMF updated the task description for T408579: Add dpogorzelski to ML and Data Platform posix groups.
Oct 28 2025, 3:20 PM · Data-Engineering, SRE, SRE-Access-Requests
DPogorzelski-WMF created T408579: Add dpogorzelski to ML and Data Platform posix groups.
Oct 28 2025, 3:19 PM · Data-Engineering, SRE, SRE-Access-Requests
DPogorzelski-WMF created T408519: security@wikimedia.org mailing list subscription.
Oct 28 2025, 8:33 AM · SecTeam-Processed, Security-Team

Oct 27 2025

DPogorzelski-WMF added a comment to T407955: Requesting access to ops-limited for dpogorzelski.

Hello, would it be possible to have it approved?
Thanks!

Oct 27 2025, 8:26 AM · SRE, SRE-Access-Requests

Oct 24 2025

DPogorzelski-WMF added a comment to T407839: Add Dawid Pogorzelski to WMF GitHub organization.

Invite accepted, 2fa has always been on :)

Oct 24 2025, 7:31 AM · Machine-Learning-Team, Wikimedia-GitHub

Oct 22 2025

DPogorzelski-WMF created T407955: Requesting access to ops-limited for dpogorzelski.
Oct 22 2025, 9:31 AM · SRE, SRE-Access-Requests