Page MenuHomePhabricator

Run stress tests on docker images infrastructure
Open, HighPublic

Description

Intro

One of the things that came to mind in T259817 is the fact that while we may use a variety of mechanisms to trim down our mediawiki docker images, it's quite possibly they 'll end up being rather large. We also can't rule out the fact that no matter what trimming we do to them, they will eventually start growing in size again over time.

What we would like is to know is which bottlenecks we have in our infrastructure that could/would end up causing issues during a deployment of mediawiki. Arguably we are using scap almost daily and things don't break, but the container image approach is different enough to warrant this investigation.

Some things that might end up having issues:

  • The docker registry. If too many servers end up reaching simultaneously to it to fetch the various images layers, we might see saturation of some resource (e.g. connections, network)
  • Swift. It's the backing store for the registry, so we could just end up saturating swift.
  • The datacenter network switches. Assuming the registry and swift don't saturate and are able to send out enough data, we could end up saturating some network uplink on some switch.
  • Something else I am currently missing.

Of the above, I 'd give this order in terms of probability of having issues. docker-registry, swift, network.

Plan

One simple way of testing this is to just create manually a number of docker images of various sizes (say 1G to 40G) push them to the registry and then fetch them simultaneously from as many servers as possible in the backup DC and during a scheduled maintenance window. That should keep the risk low and the possible consequences not felt by end-users.

Results

TO BE ADDED

Event Timeline

akosiaris triaged this task as Medium priority.Wed, Sep 30, 3:59 PM
akosiaris created this task.
Joe raised the priority of this task from Medium to High.Tue, Oct 13, 8:06 AM

This needs to be done while we have one DC turned off for most traffic as we do right now, IMHO.

dancy added a subscriber: dancy.Tue, Oct 13, 4:21 PM
dancy added a comment.Tue, Oct 13, 4:34 PM

I'm looking forward to seeing the results of this stress test. Does the docker registry have any controls to limit the number of connections and/or total bandwidth?

Mentioned in SAL (#wikimedia-operations) [2020-10-14T14:01:52Z] <akosiaris> push a 6GB image, named docker-registry.discovery.wmnet/mwcachedir:0.0.1, containing the cache/ dir of a mediawiki installation to the registry. T264209

1st obstacle found already. The push failed with '500 internal server error'. Logs indicate

2020/10/14 14:07:47 [crit] 2912#2912: *2696510 pwrite() "/var/lib/nginx/body/0000000958" failed (28: No space left on device)

/var/lib/nginx is a tmpfs with size of 1G on the registry hosts. /var/lib/nginx is meant to be used when client_body_buffer_size is not of a sufficient size. The image is a single layer 6GB one. The error happens at ~4.4GB despite the 1GB size of /var/lib/nginx. My guess is compression right now, I 'll bump the tmpfs size a bit and see how it helps

Mentioned in SAL (#wikimedia-operations) [2020-10-15T10:00:26Z] <akosiaris> T264209. Initiate a docker pull of docker-registry.discovery.wmnet/mwcachedir:0.0.1 from all kubernetes and kubernetes staging nodes.

The first pull test was successful. 34 hosts pull from the registry simultaneously. The test lasted about 5minutes.

Observations (most innocuous to worse):

Innocuous to indifferent for now

  • The network did not notice much. Graphs (not public) showed an increase of 1Gbps (more on that later), but upstream link wise, switches are at most 25%, so we are ok on that front even if we noticed 10 as much traffic
  • Swift did not complain at all as far as I can tell. Graphs for network traffic[4] point the traffic out, but the rest indicate nothing noticeable.

It's important to note that the tests happened while swift was rebalancing itself so that's great.

Good news

There is compression at play here. nginx reported the following in access logs. Error logs were empty

10.192.16.138 - - [15/Oct/2020:10:04:40 +0000] "GET /v2/mwcachedir/blobs/sha256:fcf2546deda0a9f5f15e9e8c5671bd18022372a929edfdd1c056399a6e221d14 HTTP/1.1" 200 1378740282 "-" "docker/1.12.6 go/go1.6.4 git-commit/78d1802 kernel/4.9.0-12-amd64 os/linux arch/amd64 UpstreamClient(D

So a 5.99GB image, resulted into a transferring only 1378740282 or 1.28GB. This isn't HTTP Content-Encoding compression, the blobs comprising the images are shipped as a gzipped tarballs.

There is a downside to this. On the client side, the images need to be decompressed, which consumes some time. A relevant task is https://github.com/moby/moby/issues/1266 where people ask to disable that for speed reasons. As I point out below, the compression part, for now, has been beneficial.

Somewhat concerning

  • Registry wise, 95pth percentile for get blob exploded to 1m. [3]

It's misleading however. You can tell immediately by the way it flatlines at 1m when the test lasted 5minutes and it was a single blob to be downloaded (just 35 times). It's also a prometheus histogram and my guess is the largest bucket is 1m.

  • Registry 95pth percentile for getting stuff from Swift also doubled[5].

I 'd say expected given the size of the blob.

I 'd expect those to increase even more with more hosts pulling the data, which would make all deployments relying on this infra slower. However on the bright side, neither nginx nor docker-registry software hit any limits.

Bad news

We are currently capped at fetching at most 2Gbps due to the number of docker-registry servers we got (2). See

  • registry2001 [1]

  • registry2002 [2]

Of course the saturation lead to packet loss and TCP retransmits [1] and [2]

The hosts also did not have a good time CPU wise. Interestingly a 20% of this was user CPU, with the other 38% and 12% being system and softirq.

  • registry2001 [1]

  • registry2002 [2]

Preliminary conclusions

  • We are probably ok on the network infrastructure side and we should be until deployments start moving around 40Gbps or more in total bandwidth
  • Swift is also probably ok for now. Tests should be re-run however after the registry infrastructure has been upgraded
  • We are limited on the docker-registry infrastructure side. For now, this leads to degraded performance when a massive deployment would take place. The packet loss exacerbates this more. This leads to slower deployments for now because of the unavoidable network throttling. Given the retries kubelet will do, even fetch failures might be hidden for quite some time, but eventually we might experience failed deployments. While that's improbable for now, it's good to keep it in mind.
  • We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed This is going to be interesting and difficult to communicate as:
    • docker tools only list the uncompressed size
    • The limit is per layer, not per image. So even an image that is 5GB image might be fine if all individual layers are smaller than 1GB compressed
  • We might want eventually (if possible at all, to disable docker compression) to speed up deployments. However to do that, we 'll need first to make sure we address the registry capacity issues mentioned above.

[1] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=registry2001&var-datasource=thanos&var-cluster=misc&from=1602755631799&to=1602756642823
[2] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=registry2002&var-datasource=thanos&var-cluster=misc&from=1602755631799&to=1602756642823
[3] https://grafana.wikimedia.org/d/StcefURWz/docker-registry?viewPanel=16&orgId=1&from=1602755822548&to=1602756680459&var-datasource=codfw%20prometheus%2Fops
[4] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=17&orgId=1&from=1602755933148&to=1602756460049&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops.
[5] https://grafana.wikimedia.org/d/StcefURWz/docker-registry?viewPanel=22&orgId=1&from=1602755822548&to=1602756680459&var-datasource=codfw%20prometheus%2Fops.

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

dancy added a comment.Thu, Oct 15, 3:47 PM

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.

The main issue will be that large pushes to the registry will become slower. Hence CI will be taking longer overall. That being said, we probably should try to optimize for a combination of client_body_buffer_size and a larger but slower fs that addresses the most common patterns in our CI.

That being said, with compression on the client before the push taking also a significant time as well (per people's reports in https://github.com/moby/moby/issues/1266), the delay from lower IOPS might not be the most contributing factor here (and there doesn't seem to be anything we can do about it)

I 'll try and devise a couple of tests to run to get numbers on this.

For the We are limited on the docker-registry infrastructure side., the sanest way out of this (until we hit the next bottleneck) is to scale out, aka just more docker registry VMs. That should be easily doable, we got the capacity. The VMs should be split across the rack rows for higher availability.

The interesting question here, how many? I am gonna put a limitation here that it should be in multiples of 4 (per DC) to match our failure domains (aka availability zones).

We could do 4 which would double the current number and that could possibly halve transfer times. Or we could do 8 or 12 and reap even more benefits (and be more future proof for when we got 4 or 5 times the current number of fetching nodes in production).

However there is another factor to consider here and it's the decompression after the fetch of the blobs. We need also numbers as to how much time decompressing the layers takes. Putting this requirement in as well for the next test.

Joe moved this task from Backlog to In Progress on the MW-on-K8s board.Mon, Oct 19, 8:52 AM