Add Dragonfly to the ML k8s clusters
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Mar 6 2024, 4:15 PM

Description

Starting point https://wikitech.wikimedia.org/wiki/Dragonfly

Dragonfly is a p2p network composed by k8s worker nodes and supernodes, that sits between every Kubelet and the Docker Registry. It acts as distributed cache to avoid hitting the Docker registry too many time when the content can be streamed from other peers.

High level questions:

Should we create new supernodes only for us, or is it fine to use the ones that Wikikube uses?

Details

Subject	Repo	Branch	Lines +/-
role::ml_k8s::staging::worker: add Dragonfly	operations/puppet	production	+8 -0
Add Dragonfly 2p2 cache to ml-serve k8s	operations/puppet	production	+8 -0
Add fake Docker secret config for Dragonfly on ml-serve k8s	labs/private	master	+1 -0
Add Docker secret for Dragonfly cache to ML K8s staging	labs/private	master	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T359067 Find an efficient strategy to add Pytorch and ROCm packages to our Docker images
		Resolved		elukey	T359416 Add Dragonfly to the ML k8s clusters

Event Timeline

elukey created this task.Mar 6 2024, 4:15 PM

I think it's fine to use the existing supernodes. They act as coordinators only, so there is not much load or network traffic even during mw-deployments.

MoritzMuehlenhoff subscribed.Mar 7 2024, 10:57 AM

Change 1009548 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::staging::worker: add Dragonfly

https://gerrit.wikimedia.org/r/1009548

gerritbot added a project: Patch-For-Review.Mar 7 2024, 2:51 PM

Change 1009758 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add Docker secret for Dragonfly cache to ML K8s staging

https://gerrit.wikimedia.org/r/1009758

Change 1009758 merged by Elukey:

[labs/private@master] Add Docker secret for Dragonfly cache to ML K8s staging

https://gerrit.wikimedia.org/r/1009758

elukey mentioned this in rLPRI40bec762c6d0: Add Docker secret for Dragonfly cache to ML K8s staging.Mar 8 2024, 1:46 PM

Change 1009548 merged by Elukey:

[operations/puppet@production] role::ml_k8s::staging::worker: add Dragonfly

https://gerrit.wikimedia.org/r/1009548

Maintenance_bot removed a project: Patch-For-Review.Mar 8 2024, 2:30 PM

JMeybohm mentioned this in T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.Mar 8 2024, 2:31 PM

Dragonfly deployed to staging, now we need to test it and see how it works :)

elukey claimed this task.Mar 8 2024, 4:28 PM

isarantopoulos set the point value for this task to 5.Mar 12 2024, 2:43 PM

isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Today we tested the deployment of a new image in staging, and everything worked as expected. Some notes:

The new image was correctly downloaded from the Registry the first time.
I cordoned the 2001 node (first one that got the new image/pod), killed the pod and waited for the new image to be pulled via p2p from Dragonfly. It seems to have worked, I saw some chunks streamed from 2001 to 2002, but I am not 100% sure how things should work now so more reading and testing is needed.

Next steps:

Test on staging more use cases
Deploy to prod

Change 1010534 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add fake Docker secret config for Dragonfly on ml-serve k8s

https://gerrit.wikimedia.org/r/1010534

gerritbot added a project: Patch-For-Review.Mar 12 2024, 3:12 PM

Change 1010534 merged by Elukey:

[labs/private@master] Add fake Docker secret config for Dragonfly on ml-serve k8s

https://gerrit.wikimedia.org/r/1010534

Change 1010535 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add Dragonfly 2p2 cache to ml-serve k8s

https://gerrit.wikimedia.org/r/1010535

elukey mentioned this in rLPRI9a4847aedae7: Add fake Docker secret config for Dragonfly on ml-serve k8s.Mar 12 2024, 3:37 PM

dfdaemon logs on 2001 (first pull of the image in the cluster):

[..]
2024-03-12 13:24:04.218 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:fd9f79863c2d0c030047c3fe23a317131a1cf5cea2fe2dda92576da18c4258df [SUCCESS] cost:1.193s
2024-03-12 13:24:04.940 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:22411a39cf8600086f2138bd46e86b5ca9e587f5e0d3a2beeccb793df0534419 [SUCCESS] cost:1.279s
2024-03-12 13:24:16.540 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c [SUCCESS] cost:12.887s

Then on 2002 (image already present on 2001):

[..]
2024-03-12 13:30:51.372 INFO sign:3239705 : start download url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c to 79e0074f-7fad-4084-9ab7-f47f6
d998c8e in repo
2024-03-12 13:30:51.788 INFO sign:3239705 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:22411a39cf8600086f2138bd46e86b5ca9e587f5e0d3a2beeccb793df0534419 [SUCCESS] cost:0.553s
2024-03-12 13:31:01.650 INFO sign:3239705 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c [SUCCESS] cost:10.278s

The time taken on 2002 for the last 3 layers is considerably less than on 2001. But I checked 74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c on registry2001's nginx access log and I see:

two entries for the supernode2001 host
30 for ml-staging2001
35 for ml-staging2002

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

Change 1010535 merged by Elukey:

[operations/puppet@production] Add Dragonfly 2p2 cache to ml-serve k8s

https://gerrit.wikimedia.org/r/1010535

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2024, 2:30 PM

Dragonfly deployed to ml serve production clusters. The last step is to figure out if everything works as expected, see T359416#9624256

In T359416#9624256, @elukey wrote:

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

I'm not sure that's what is supposed to happen. I don't recall exactly from the top of my head but IIRC dragonfly tries to be a bit clever and will still use the seed (docker registry) as part of the sources (especially if there only is one source within the p2p network itself).

In T359416#9636703, @JMeybohm wrote:

In T359416#9624256, @elukey wrote:

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

I'm not sure that's what is supposed to happen. I don't recall exactly from the top of my head but IIRC dragonfly tries to be a bit clever and will still use the seed (docker registry) as part of the sources (especially if there only is one source within the p2p network itself).

Ack I can try to check in production for a broader use case, I was just wondering if my understanding was right or not. I naively thought that once a k8s worker got the a new image from the registry, other nodes would have fetched only from it.

I also have another question: you mentioned that the docker registry is the seed, but I am wondering what the supernode does. From the Dragonfly's Youtube video that is linked in Wikitech, the speaker seems to suggest that the supernode downloads the content from the source (IIUC the registry in our case) and then acts as seed. I am asking how things work because the supernode vms are tiny and if they have to download content from the registry and act as seed, there may be some network bw issue if multiple new images are requested (although it shouldn't be a big concern, but I'd like to clear my head about doubts :D).

That's right. By default the supernode does act as a CDN in front of the docker-registry but I intentionally disabled that behavior as there's no benefit of that in our infra. I would assume that the supernodes do still query the registry for some sanity check data maybe but the seeding happens directly from the docker-registry instances.

Rolled out Dragonfly to all ml clusters!

elukey moved this task from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.Apr 4 2024, 3:57 PM

elukey closed this task as Resolved.Apr 5 2024, 2:13 PM

Add Dragonfly to the ML k8s clustersClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add Dragonfly to the ML k8s clusters
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...