Page MenuHomePhabricator

Add Dragonfly to the ML k8s clusters
Closed, ResolvedPublic5 Estimated Story Points

Description

Starting point https://wikitech.wikimedia.org/wiki/Dragonfly

Dragonfly is a p2p network composed by k8s worker nodes and supernodes, that sits between every Kubelet and the Docker Registry. It acts as distributed cache to avoid hitting the Docker registry too many time when the content can be streamed from other peers.

High level questions:

  • Should we create new supernodes only for us, or is it fine to use the ones that Wikikube uses?

Event Timeline

I think it's fine to use the existing supernodes. They act as coordinators only, so there is not much load or network traffic even during mw-deployments.

Change 1009548 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_k8s::staging::worker: add Dragonfly

https://gerrit.wikimedia.org/r/1009548

Change 1009758 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add Docker secret for Dragonfly cache to ML K8s staging

https://gerrit.wikimedia.org/r/1009758

Change 1009758 merged by Elukey:

[labs/private@master] Add Docker secret for Dragonfly cache to ML K8s staging

https://gerrit.wikimedia.org/r/1009758

Change 1009548 merged by Elukey:

[operations/puppet@production] role::ml_k8s::staging::worker: add Dragonfly

https://gerrit.wikimedia.org/r/1009548

Dragonfly deployed to staging, now we need to test it and see how it works :)

isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Today we tested the deployment of a new image in staging, and everything worked as expected. Some notes:

  • The new image was correctly downloaded from the Registry the first time.
  • I cordoned the 2001 node (first one that got the new image/pod), killed the pod and waited for the new image to be pulled via p2p from Dragonfly. It seems to have worked, I saw some chunks streamed from 2001 to 2002, but I am not 100% sure how things should work now so more reading and testing is needed.

Next steps:

  • Test on staging more use cases
  • Deploy to prod

Change 1010534 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Add fake Docker secret config for Dragonfly on ml-serve k8s

https://gerrit.wikimedia.org/r/1010534

Change 1010534 merged by Elukey:

[labs/private@master] Add fake Docker secret config for Dragonfly on ml-serve k8s

https://gerrit.wikimedia.org/r/1010534

Change 1010535 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add Dragonfly 2p2 cache to ml-serve k8s

https://gerrit.wikimedia.org/r/1010535

dfdaemon logs on 2001 (first pull of the image in the cluster):

[..]
2024-03-12 13:24:04.218 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:fd9f79863c2d0c030047c3fe23a317131a1cf5cea2fe2dda92576da18c4258df [SUCCESS] cost:1.193s
2024-03-12 13:24:04.940 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:22411a39cf8600086f2138bd46e86b5ca9e587f5e0d3a2beeccb793df0534419 [SUCCESS] cost:1.279s
2024-03-12 13:24:16.540 INFO sign:2010962 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c [SUCCESS] cost:12.887s

Then on 2002 (image already present on 2001):

[..]
2024-03-12 13:30:51.372 INFO sign:3239705 : start download url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c to 79e0074f-7fad-4084-9ab7-f47f6
d998c8e in repo
2024-03-12 13:30:51.788 INFO sign:3239705 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:22411a39cf8600086f2138bd46e86b5ca9e587f5e0d3a2beeccb793df0534419 [SUCCESS] cost:0.553s
2024-03-12 13:31:01.650 INFO sign:3239705 : dfget url:https://docker-registry.discovery.wmnet/v2/wikimedia/machinelearning-liftwing-inference-services-readability/blobs/sha256:74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c [SUCCESS] cost:10.278s

The time taken on 2002 for the last 3 layers is considerably less than on 2001. But I checked 74b022bb79e87f21256273490fdc55c781ccf33b1bc120529df4dd4e6715b15c on registry2001's nginx access log and I see:

  • two entries for the supernode2001 host
  • 30 for ml-staging2001
  • 35 for ml-staging2002

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

Change 1010535 merged by Elukey:

[operations/puppet@production] Add Dragonfly 2p2 cache to ml-serve k8s

https://gerrit.wikimedia.org/r/1010535

Dragonfly deployed to ml serve production clusters. The last step is to figure out if everything works as expected, see T359416#9624256

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

I'm not sure that's what is supposed to happen. I don't recall exactly from the top of my head but IIRC dragonfly tries to be a bit clever and will still use the seed (docker registry) as part of the sources (especially if there only is one source within the p2p network itself).

Afaics from the logs the client were getting chunks of data every time from the registry (not the entire content at once), but I am wondering if all the docker image on ml-staging2002 should have been pulled entirely from ml-staging2001. Anything that I am missing @JMeybohm ?

I'm not sure that's what is supposed to happen. I don't recall exactly from the top of my head but IIRC dragonfly tries to be a bit clever and will still use the seed (docker registry) as part of the sources (especially if there only is one source within the p2p network itself).

Ack I can try to check in production for a broader use case, I was just wondering if my understanding was right or not. I naively thought that once a k8s worker got the a new image from the registry, other nodes would have fetched only from it.

I also have another question: you mentioned that the docker registry is the seed, but I am wondering what the supernode does. From the Dragonfly's Youtube video that is linked in Wikitech, the speaker seems to suggest that the supernode downloads the content from the source (IIUC the registry in our case) and then acts as seed. I am asking how things work because the supernode vms are tiny and if they have to download content from the registry and act as seed, there may be some network bw issue if multiple new images are requested (although it shouldn't be a big concern, but I'd like to clear my head about doubts :D).

That's right. By default the supernode does act as a CDN in front of the docker-registry but I intentionally disabled that behavior as there's no benefit of that in our infra. I would assume that the supernodes do still query the registry for some sanity check data maybe but the seeding happens directly from the docker-registry instances.

Rolled out Dragonfly to all ml clusters!