Intro
One of the things that came to mind in T259817 is the fact that while we may use a variety of mechanisms to trim down our mediawiki docker images, it's quite possibly they 'll end up being rather large. We also can't rule out the fact that no matter what trimming we do to them, they will eventually start growing in size again over time.
What we would like is to know is which bottlenecks we have in our infrastructure that could/would end up causing issues during a deployment of mediawiki. Arguably we are using scap almost daily and things don't break, but the container image approach is different enough to warrant this investigation.
Some things that might end up having issues:
- The docker registry. If too many servers end up reaching simultaneously to it to fetch the various images layers, we might see saturation of some resource (e.g. connections, network)
- Swift. It's the backing store for the registry, so we could just end up saturating swift.
- The datacenter network switches. Assuming the registry and swift don't saturate and are able to send out enough data, we could end up saturating some network uplink on some switch.
- Something else I am currently missing.
Of the above, I 'd give this order in terms of probability of having issues. docker-registry, swift, network.
Plan
One simple way of testing this is to just create manually a number of docker images of various sizes (say 1G to 40G) push them to the registry and then fetch them simultaneously from as many servers as possible in the backup DC and during a scheduled maintenance window. That should keep the risk low and the possible consequences not felt by end-users.
Results
https://wikitech.wikimedia.org/wiki/User:JMeybohm/Docker-Registry-Stresstest
TO BE ADDED