Page MenuHomePhabricator

Evaluate Dragonfly for distribution of docker images
Closed, ResolvedPublic

Description

T264209 has shown that we'll need a different way of distributing docker images to kubernetes nodes.

I did a evaluation of Kraken and Dragonfly as docker-registry P2P layers (https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-P2P) and settled with Dragonfly for a first test.

Notes I took during initial packaging, testing and writing first puppet code:

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+13 -97
operations/puppetproduction+4 -8
operations/puppetproduction+5 -4
operations/puppetproduction+1 -1
operations/puppetproduction+15 -4
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
operations/puppetproduction+16 -8
operations/puppetproduction+11 -11
operations/debs/dragonflymaster+18 -18
operations/puppetproduction+1 -1
operations/puppetproduction+90 -2
operations/debs/dragonflymaster+291 -0
integration/configmaster+5 -0
operations/puppetproduction+11 -1
operations/puppetproduction+2 -0
operations/puppetproduction+18 -1
operations/puppetproduction+8 -3
operations/puppetproduction+1 -1
operations/puppetproduction+15 -6
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+4 -4
operations/puppetproduction+28 -16
operations/puppetproduction+13 -6
operations/puppetproduction+11 -0
operations/puppetproduction+0 -7
operations/puppetproduction+203 -0
Show related patches Customize query in gerrit

Event Timeline

JMeybohm triaged this task as High priority.Jul 2 2021, 2:22 PM

Change 701530 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Add dragonfly supernode and client (dfdaemon) modules

https://gerrit.wikimedia.org/r/701530

Change 701530 merged by JMeybohm:

[operations/puppet@production] dragonfly: Add dragonfly supernode and client (dfdaemon) modules

https://gerrit.wikimedia.org/r/701530

Change 702979 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Remove fetching of $docker_registry_fqdn cert

https://gerrit.wikimedia.org/r/702979

Change 702982 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp

https://gerrit.wikimedia.org/r/702982

Change 702979 merged by JMeybohm:

[operations/puppet@production] dragonfly: Remove fetching of $docker_registry_fqdn cert

https://gerrit.wikimedia.org/r/702979

Change 702982 merged by JMeybohm:

[operations/puppet@production] site/install_server: Add dragonfly-supernode1001 to DHCP and site.pp

https://gerrit.wikimedia.org/r/702982

Change 704318 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Make profile and module ensureable

https://gerrit.wikimedia.org/r/704318

Change 704322 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::*::worker: include dragonfly dfdaemon

https://gerrit.wikimedia.org/r/704322

Change 704360 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Trim newlines in config files

https://gerrit.wikimedia.org/r/704360

Change 704318 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Make profile and module ensureable

https://gerrit.wikimedia.org/r/704318

Change 704360 merged by JMeybohm:

[operations/puppet@production] dragonfly: Trim newlines in config files

https://gerrit.wikimedia.org/r/704360

Change 704322 merged by JMeybohm:

[operations/puppet@production] kubernetes::*::worker: include dragonfly dfdaemon

https://gerrit.wikimedia.org/r/704322

Change 704755 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Write certificate to /etc/dragonfly

https://gerrit.wikimedia.org/r/704755

Change 704755 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Write certificate to /etc/dragonfly

https://gerrit.wikimedia.org/r/704755

Change 704758 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Fix HTTPS_PROXY URI to actually use HTTPS

https://gerrit.wikimedia.org/r/704758

Change 704758 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Fix HTTPS_PROXY URI to actually use HTTPS

https://gerrit.wikimedia.org/r/704758

Change 704921 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Don't run pki::get_cert in ensure=absent case

https://gerrit.wikimedia.org/r/704921

Change 704921 merged by JMeybohm:

[operations/puppet@production] dragonfly: Don't run pki::get_cert in ensure=absent case

https://gerrit.wikimedia.org/r/704921

Change 705382 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Fix typo in dfdaemon config template

https://gerrit.wikimedia.org/r/705382

Change 705382 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Fix typo in dfdaemon config template

https://gerrit.wikimedia.org/r/705382

Change 705627 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon

https://gerrit.wikimedia.org/r/705627

Change 705627 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon

https://gerrit.wikimedia.org/r/705627

Change 705639 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Enable dfdaemon on eqiad kubernetes nodes

https://gerrit.wikimedia.org/r/705639

Change 705646 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus::ops: Add jobs to scrape dragonfly supernodes

https://gerrit.wikimedia.org/r/705646

Change 705646 merged by JMeybohm:

[operations/puppet@production] prometheus::ops: Add scraping config for dragonfly supernodes

https://gerrit.wikimedia.org/r/705646

Change 705639 merged by JMeybohm:

[operations/puppet@production] dragonfly: Enable dfdaemon on eqiad kubernetes nodes

https://gerrit.wikimedia.org/r/705639

I did ran the ramp up test like I did for plain registry pulls with dragonfly on default settings (more or less) yesterday (see Wikitech for details on the process).
The tests where run in eiqad this time, but I used the eqiad docker registry nodes as well (to keep traffic DC local, besides from initial nginx cache warmup).

With the current settings, the average download time for the images has been pretty much constant, regardless of the number of nodes pulling in parallel (which is great). Also the network load on the docker-registries decreased overall and into a very acceptable area[1]. Unfortunately, the average download time for the image (as well as standard derivation) increased significantly[2] to around 5min.

I think there is quite some room for improvement by tuning dragonfly config. I'll be looking into that.

Interactive dashboards:

Change 708068 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Allow to specify ratelimit for dfdaemon

https://gerrit.wikimedia.org/r/708068

Change 708068 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Allow to specify ratelimit for dfdaemon

https://gerrit.wikimedia.org/r/708068

I went through various config options for the different components involved, but the only one yielding significant impact is rate limiting (surprise) which is set to 20M per default, when dfget is started via dfdaemon (dfget default is unlimited).

With a rate limit of 100M the average download time (15 parallel nodes) is around 00:01:35 with a standard deviation of 00:00:50. This is way more close to what we have seen running plain docker pull against 6 registry nodes (average DL time with 15 nodes: 00:01:18, std dev: 00:00:24) but obviously without the massive network usage on the registry nodes.

Change 708483 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/dragonfly@master] Add debian directory

https://gerrit.wikimedia.org/r/708483

Change 708534 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/dragonfly@master] Create dragonfly user via systemd-sysusers

https://gerrit.wikimedia.org/r/708534

Change 708536 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[integration/config@master] Debian glue for operations/debs/dragonfly

https://gerrit.wikimedia.org/r/708536

Change 708536 merged by jenkins-bot:

[integration/config@master] Debian glue for operations/debs/dragonfly

https://gerrit.wikimedia.org/r/708536

Change 708483 merged by jenkins-bot:

[operations/debs/dragonfly@master] Add debian directory

https://gerrit.wikimedia.org/r/708483

Change 708534 merged by jenkins-bot:

[operations/debs/dragonfly@master] Create dragonfly user via systemd-sysusers

https://gerrit.wikimedia.org/r/708534

Mentioned in SAL (#wikimedia-operations) [2021-08-03T09:11:50Z] <jayme> importing dragonfly 1.0.6-2 to buster-wikimedia and stretch-wikimedia - T286054

Change 709703 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Enable metric scraping for dfdaemon

https://gerrit.wikimedia.org/r/709703

Change 709704 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus::ops: Scrape metrics from dfdaemon

https://gerrit.wikimedia.org/r/709704

Change 709719 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add a temporary role for appservers plus docker and dragonfly

https://gerrit.wikimedia.org/r/709719

Change 709740 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] site: Switch a bunch of eqiad appservers to appserver_dragonfly role

https://gerrit.wikimedia.org/r/709740

Change 709719 merged by JMeybohm:

[operations/puppet@production] Add a temporary role for appservers plus docker and dragonfly

https://gerrit.wikimedia.org/r/709719

Change 709952 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] mediawiki::appserver_dragonfly: Fix docker package name

https://gerrit.wikimedia.org/r/709952

Change 709952 merged by JMeybohm:

[operations/puppet@production] mediawiki::appserver_dragonfly: Fix docker package name

https://gerrit.wikimedia.org/r/709952

I've just reverted the CR adding systemd-sysusers to the dragonfly packages (https://gerrit.wikimedia.org/r/c/operations/debs/dragonfly/+/708534) as it segfaults on mw1384. The patch should be added back when T256098 is fixed.

Mentioned in SAL (#wikimedia-operations) [2021-08-04T10:29:40Z] <jayme> importing dragonfly 1.0.6-1 (downgrade from 1.0.6-2) to buster-wikimedia and stretch-wikimedia - T286054

Change 709740 merged by JMeybohm:

[operations/puppet@production] site: Switch a bunch of eqiad appservers to appserver_dragonfly role

https://gerrit.wikimedia.org/r/709740

Mentioned in SAL (#wikimedia-operations) [2021-08-04T10:48:44Z] <jayme> switch most eqiad appservers to appserver_dragonly role for testing - T286054

The test with 73 nodes max shows a pull time of 00:01:46 with standard deviation of 00:00:33 [1] which is pretty close to the numbers from the test with 15 nodes. When looking at those numbers it must be taken into account that the appserver nodes do have to pull every layer of the image while the kubernetes nodes may skip some, because they are reused by some other image (like the debian base one) and can therefore not be removed prior to testing.

The network load on the registries in eqiad did peak around 10MB/s during the whole test run and the system load of the supernode is nowhere near problematic, peaking somewhere around 25% CPU usage / load 5 of 0.34 and less than 500MiB Used memory. So we can even lower the specs (at least memory wise) for the supernode in production I suppose.

Dynamic dashboards for the test time frame;

Change 709703 merged by JMeybohm:

[operations/puppet@production] dragonfly: Enable metric scraping for dfdaemon

https://gerrit.wikimedia.org/r/709703

Change 710247 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] site/install_server: Add dragonfly-supernode2001 to DHCP and site.pp

https://gerrit.wikimedia.org/r/710247

Change 710247 merged by JMeybohm:

[operations/puppet@production] site/install_server: Add dragonfly-supernode2001 to DHCP and site.pp

https://gerrit.wikimedia.org/r/710247

Change 710261 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Switch codfw peers to codfw supernode

https://gerrit.wikimedia.org/r/710261

Change 710261 merged by JMeybohm:

[operations/puppet@production] dragonfly: Switch codfw peers to codfw supernode

https://gerrit.wikimedia.org/r/710261

FTR: I also did some tests pulling the images cross-dc (which we do usually because of the current active/passive nature of docker-registry) which does not yield significantly different results. Also pulling from just one registry (which I did by accident) went totally fine.

What I did check as well are failure scenarios like the supernode being unreachable when starting a pull and becoming unreachable during a pull. Both situations are handled transparently by dfdaemon which then falls back to pulling the chunks from docker-registry directly.

As all the (unencrypted) P2P traffic will stay DC local and only root users have access to kubernetes nodes, we decided to risk accept the no-TLS and credentials leak issues from the task summary for now and continue with rolling out dragonfly to codfw.

Change 709704 merged by JMeybohm:

[operations/puppet@production] prometheus::ops: Scrape metrics from dfdaemon

https://gerrit.wikimedia.org/r/709704

Change 710295 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Add local hostname to dfdaemon certificate

https://gerrit.wikimedia.org/r/710295

Change 710295 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Add local hostname to dfdaemon certificate

https://gerrit.wikimedia.org/r/710295

Change 710484 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Notify dfdaemon on certificate change

https://gerrit.wikimedia.org/r/710484

Change 710485 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] appserver_dragonfly: Remove experimental stuff from appservers

https://gerrit.wikimedia.org/r/710485

Change 710487 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Clean up appserver_dragonfly test role

https://gerrit.wikimedia.org/r/710487

Change 710484 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Notify dfdaemon on certificate change

https://gerrit.wikimedia.org/r/710484

Change 710485 merged by JMeybohm:

[operations/puppet@production] appserver_dragonfly: Remove experimental stuff from appservers

https://gerrit.wikimedia.org/r/710485

Change 710487 merged by JMeybohm:

[operations/puppet@production] Clean up appserver_dragonfly test role

https://gerrit.wikimedia.org/r/710487

Change 710522 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Ensure on codfw kubernetes nodes

https://gerrit.wikimedia.org/r/710522

Change 710522 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Ensure on codfw kubernetes nodes

https://gerrit.wikimedia.org/r/710522

Change 710528 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add dragonfly-peer and supernode cumin aliases

https://gerrit.wikimedia.org/r/710528

Dragonfly rolled out and active on main and staging clusters in eqiad and codfw.

Change 710528 merged by JMeybohm:

[operations/puppet@production] Add dragonfly-peer and supernode cumin aliases

https://gerrit.wikimedia.org/r/710528

Mentioned in SAL (#wikimedia-operations) [2021-08-27T10:56:42Z] <akosiaris> sudo cumin 'mw*' 'ip ro ls dev docker0 && sysctl net.ipv4.ip_forward=0' to clear up the docker remnants of the dragonfly evaluation. T286054