Page MenuHomePhabricator

Run stress tests on docker images infrastructure
Closed, ResolvedPublic

Assigned To
Authored By
akosiaris
Sep 30 2020, 3:59 PM
Referenced Files
F32385578: image.png
Oct 15 2020, 11:54 AM
F32385597: image.png
Oct 15 2020, 11:54 AM
F32385574: image.png
Oct 15 2020, 11:54 AM
F32385583: image.png
Oct 15 2020, 11:54 AM
F32385595: image.png
Oct 15 2020, 11:54 AM
F32385590: image.png
Oct 15 2020, 11:54 AM
F32385587: image.png
Oct 15 2020, 11:54 AM
F32385585: image.png
Oct 15 2020, 11:54 AM

Description

Intro

One of the things that came to mind in T259817 is the fact that while we may use a variety of mechanisms to trim down our mediawiki docker images, it's quite possibly they 'll end up being rather large. We also can't rule out the fact that no matter what trimming we do to them, they will eventually start growing in size again over time.

What we would like is to know is which bottlenecks we have in our infrastructure that could/would end up causing issues during a deployment of mediawiki. Arguably we are using scap almost daily and things don't break, but the container image approach is different enough to warrant this investigation.

Some things that might end up having issues:

  • The docker registry. If too many servers end up reaching simultaneously to it to fetch the various images layers, we might see saturation of some resource (e.g. connections, network)
  • Swift. It's the backing store for the registry, so we could just end up saturating swift.
  • The datacenter network switches. Assuming the registry and swift don't saturate and are able to send out enough data, we could end up saturating some network uplink on some switch.
  • Something else I am currently missing.

Of the above, I 'd give this order in terms of probability of having issues. docker-registry, swift, network.

Plan

One simple way of testing this is to just create manually a number of docker images of various sizes (say 1G to 40G) push them to the registry and then fetch them simultaneously from as many servers as possible in the backup DC and during a scheduled maintenance window. That should keep the risk low and the possible consequences not felt by end-users.

Results

https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-Stresstest

TO BE ADDED

Event Timeline

akosiaris triaged this task as Medium priority.Sep 30 2020, 3:59 PM
akosiaris created this task.
Joe raised the priority of this task from Medium to High.Oct 13 2020, 8:06 AM

This needs to be done while we have one DC turned off for most traffic as we do right now, IMHO.

I'm looking forward to seeing the results of this stress test. Does the docker registry have any controls to limit the number of connections and/or total bandwidth?

Mentioned in SAL (#wikimedia-operations) [2020-10-14T14:01:52Z] <akosiaris> push a 6GB image, named docker-registry.discovery.wmnet/mwcachedir:0.0.1, containing the cache/ dir of a mediawiki installation to the registry. T264209

1st obstacle found already. The push failed with '500 internal server error'. Logs indicate

2020/10/14 14:07:47 [crit] 2912#2912: *2696510 pwrite() "/var/lib/nginx/body/0000000958" failed (28: No space left on device)

/var/lib/nginx is a tmpfs with size of 1G on the registry hosts. /var/lib/nginx is meant to be used when client_body_buffer_size is not of a sufficient size. The image is a single layer 6GB one. The error happens at ~4.4GB despite the 1GB size of /var/lib/nginx. My guess is compression right now, I 'll bump the tmpfs size a bit and see how it helps

Mentioned in SAL (#wikimedia-operations) [2020-10-15T10:00:26Z] <akosiaris> T264209. Initiate a docker pull of docker-registry.discovery.wmnet/mwcachedir:0.0.1 from all kubernetes and kubernetes staging nodes.

The first pull test was successful. 34 hosts pull from the registry simultaneously. The test lasted about 5minutes.

Observations (most innocuous to worse):

Innocuous to indifferent for now

  • The network did not notice much. Graphs (not public) showed an increase of 1Gbps (more on that later), but upstream link wise, switches are at most 25%, so we are ok on that front even if we noticed 10 as much traffic
  • Swift did not complain at all as far as I can tell. Graphs for network traffic[4] point the traffic out, but the rest indicate nothing noticeable.

image.png (976×1 px, 98 KB)

It's important to note that the tests happened while swift was rebalancing itself so that's great.

Good news

There is compression at play here. nginx reported the following in access logs. Error logs were empty

10.192.16.138 - - [15/Oct/2020:10:04:40 +0000] "GET /v2/mwcachedir/blobs/sha256:fcf2546deda0a9f5f15e9e8c5671bd18022372a929edfdd1c056399a6e221d14 HTTP/1.1" 200 1378740282 "-" "docker/1.12.6 go/go1.6.4 git-commit/78d1802 kernel/4.9.0-12-amd64 os/linux arch/amd64 UpstreamClient(D

So a 5.99GB image, resulted into a transferring only 1378740282 or 1.28GB. This isn't HTTP Content-Encoding compression, the blobs comprising the images are shipped as a gzipped tarballs.

There is a downside to this. On the client side, the images need to be decompressed, which consumes some time. A relevant task is https://github.com/moby/moby/issues/1266 where people ask to disable that for speed reasons. As I point out below, the compression part, for now, has been beneficial.

Somewhat concerning

  • Registry wise, 95pth percentile for get blob exploded to 1m. [3]

image.png (958×1 px, 76 KB)

It's misleading however. You can tell immediately by the way it flatlines at 1m when the test lasted 5minutes and it was a single blob to be downloaded (just 35 times). It's also a prometheus histogram and my guess is the largest bucket is 1m.

  • Registry 95pth percentile for getting stuff from Swift also doubled[5].

image.png (972×1 px, 58 KB)

I 'd say expected given the size of the blob.

I 'd expect those to increase even more with more hosts pulling the data, which would make all deployments relying on this infra slower. However on the bright side, neither nginx nor docker-registry software hit any limits.

Bad news

We are currently capped at fetching at most 2Gbps due to the number of docker-registry servers we got (2). See

  • registry2001 [1]

image.png (266×583 px, 16 KB)

  • registry2002 [2]

image.png (262×592 px, 16 KB)

Of course the saturation lead to packet loss and TCP retransmits [1] and [2]

image.png (269×597 px, 17 KB)

The hosts also did not have a good time CPU wise. Interestingly a 20% of this was user CPU, with the other 38% and 12% being system and softirq.

  • registry2001 [1]

image.png (263×596 px, 22 KB)

  • registry2002 [2]

image.png (268×591 px, 22 KB)

Preliminary conclusions

  • We are probably ok on the network infrastructure side and we should be until deployments start moving around 40Gbps or more in total bandwidth
  • Swift is also probably ok for now. Tests should be re-run however after the registry infrastructure has been upgraded
  • We are limited on the docker-registry infrastructure side. For now, this leads to degraded performance when a massive deployment would take place. The packet loss exacerbates this more. This leads to slower deployments for now because of the unavoidable network throttling. Given the retries kubelet will do, even fetch failures might be hidden for quite some time, but eventually we might experience failed deployments. While that's improbable for now, it's good to keep it in mind.
  • We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed This is going to be interesting and difficult to communicate as:
    • docker tools only list the uncompressed size
    • The limit is per layer, not per image. So even an image that is 5GB image might be fine if all individual layers are smaller than 1GB compressed
  • We might want eventually (if possible at all, to disable docker compression) to speed up deployments. However to do that, we 'll need first to make sure we address the registry capacity issues mentioned above.

[1] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=registry2001&var-datasource=thanos&var-cluster=misc&from=1602755631799&to=1602756642823
[2] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=registry2002&var-datasource=thanos&var-cluster=misc&from=1602755631799&to=1602756642823
[3] https://grafana.wikimedia.org/d/StcefURWz/docker-registry?viewPanel=16&orgId=1&from=1602755822548&to=1602756680459&var-datasource=codfw%20prometheus%2Fops
[4] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=17&orgId=1&from=1602755933148&to=1602756460049&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops.
[5] https://grafana.wikimedia.org/d/StcefURWz/docker-registry?viewPanel=22&orgId=1&from=1602755822548&to=1602756680459&var-datasource=codfw%20prometheus%2Fops.

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.

The main issue will be that large pushes to the registry will become slower. Hence CI will be taking longer overall. That being said, we probably should try to optimize for a combination of client_body_buffer_size and a larger but slower fs that addresses the most common patterns in our CI.

That being said, with compression on the client before the push taking also a significant time as well (per people's reports in https://github.com/moby/moby/issues/1266), the delay from lower IOPS might not be the most contributing factor here (and there doesn't seem to be anything we can do about it)

I 'll try and devise a couple of tests to run to get numbers on this.

For the We are limited on the docker-registry infrastructure side., the sanest way out of this (until we hit the next bottleneck) is to scale out, aka just more docker registry VMs. That should be easily doable, we got the capacity. The VMs should be split across the rack rows for higher availability.

The interesting question here, how many? I am gonna put a limitation here that it should be in multiples of 4 (per DC) to match our failure domains (aka availability zones).

We could do 4 which would double the current number and that could possibly halve transfer times. Or we could do 8 or 12 and reap even more benefits (and be more future proof for when we got 4 or 5 times the current number of fetching nodes in production).

However there is another factor to consider here and it's the decompression after the fetch of the blobs. We need also numbers as to how much time decompressing the layers takes. Putting this requirement in as well for the next test.

That being said, with compression on the client before the push taking also a significant time as well (per people's reports in https://github.com/moby/moby/issues/1266), the delay from lower IOPS might not be the most contributing factor here (and there doesn't seem to be anything we can do about it)

I 'll try and devise a couple of tests to run to get numbers on this.

@akosiaris did you run these tests? @JMeybohm and I were discussing this today and weren't sure if we're still bottlenecked on network or whether it'll be CPU for (de)compression.

That being said, with compression on the client before the push taking also a significant time as well (per people's reports in https://github.com/moby/moby/issues/1266), the delay from lower IOPS might not be the most contributing factor here (and there doesn't seem to be anything we can do about it)

I 'll try and devise a couple of tests to run to get numbers on this.

@akosiaris did you run these tests? @JMeybohm and I were discussing this today and weren't sure if we're still bottlenecked on network or whether it'll be CPU for (de)compression.

No I never managed to finish them. I proposed this as an OKR for this quarter but was not chosen.

My gut tells me that we will be bottlenecked on both eventually.

The main idea I had was first to run ab -n X -C Y docker-registry.discovery.wmnet/mwcachedir:0.0.1 varying X and Y for some low values of 1-5 (I expect the client node to become saturated quickly), graph request times and network throughput, then perform the same for range(1,50) nodes simultaneously and assuming that behavior will stabilize somewhat after a point do a linear regression to extrapolate up to ~300 (which is the number of nodes we expect to have).

For the CPU part the idea was in pseudocode

for i in 1..100;
do 
docker image rm docker-registry.discovery.wmnet/mwcachedir:0.0.1
time docker image pull docker-registry.discovery.wmnet/mwcachedir:0.0.1
done

and average the times across those runs then substract the average network times. That's because I expect image extraction times once they 've been pulled on the host, to be always on the same ballpark.

That way we would have an estimation of the time required to fetch the images network wise as a graph + a constant time for extracting the layers. After that we 'd know where we are more bottle necked, have an estimation of when it will become worse (node wise) and research for mitigations/solutions.

Let's hope we can do that this Q.

For the record, we're now building the actual multiversion images of mediawiki, it would be interesting to do all testing using those. In particular it's interesting imho to work on the layering so that we reduce the number of layers we need to download for each release.

Change 693382 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add 4 new docker-regisry nodes in codfw

https://gerrit.wikimedia.org/r/693382

Change 693383 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] conftool: Add 4 new docker-registry hosts

https://gerrit.wikimedia.org/r/693383

Change 693382 merged by JMeybohm:

[operations/puppet@production] Add 4 new docker-regisry nodes in codfw

https://gerrit.wikimedia.org/r/693382

Change 693383 merged by JMeybohm:

[operations/puppet@production] conftool: Add 4 new docker-registry hosts

https://gerrit.wikimedia.org/r/693383

Change 694330 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] docker-registry: Add caching config for nginx

https://gerrit.wikimedia.org/r/694330

Change 694552 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] httpbb: Add tests for docker-registry

https://gerrit.wikimedia.org/r/694552

Change 694330 merged by JMeybohm:

[operations/puppet@production] docker-registry: Add caching config for nginx

https://gerrit.wikimedia.org/r/694330

Change 694552 merged by JMeybohm:

[operations/puppet@production] httpbb: Add tests for docker-registry

https://gerrit.wikimedia.org/r/694552

Change 695439 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] httpbb: Allow tests to be templates

https://gerrit.wikimedia.org/r/695439

Change 696035 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Add dummy secrets for httpbb tests

https://gerrit.wikimedia.org/r/696035

Change 696035 merged by JMeybohm:

[labs/private@master] Add dummy secrets for httpbb tests

https://gerrit.wikimedia.org/r/696035

Change 695439 merged by JMeybohm:

[operations/puppet@production] httpbb: Allow tests to be templates

https://gerrit.wikimedia.org/r/695439

I ran a couple of "ramp up" tests with two docker-registry nodes, 6 registry nodes and 6 registry nodes with local nginx caches.
Method and results I have written down at https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-Stresstest, the TL;DR is: Without further optimization (e.g. one registry per rack-row and the k8s nodes pinned to them) we will easily hit limitations of the networking infrastructure. I had to stop the tests due to service degradation issues that are probably more related to the repetitive pulling of the mediawiki image on k8s nodes running in ganeti than the networking itself. But the linear increase in average pull time suggests, even with 6 registry nodes, that we should look at the alternatives like Dragonfly and Uber's Kraken

I made a comparison between Dragonfly and Kraken and while I do think Kraken is probably the better solution technically it's lack of documentation and the more straight forward way of integration of Dragonfly led to the decision to at least try Dragonfly first to see if it solves our problem. I'll (ab)use registry2008 as Dragonfly supernode in codfw for testing.

Change 701530 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly: Add dragonfly supernode and client (dfdaemon) modules

https://gerrit.wikimedia.org/r/701530

I'm boldly closing this.