The current situation in the Toolforge docker registry is that if a disaster happens (VM shutdown, data corruption, etc), we will have to rebuild and push all docker images to the registry, which may take a lot of time (downtime).
The proposed new HA method for the Toolforge docker registry is simple, cold-standby:
- docker images are built and they are pushed to the active registry node (using the DNS docker-registry.tools.wmflabs.org)
- the active registry node stores the image locally (usually /srv/registry)
- there is a daily cron job running in the standby node to rsync the registry data from the active node.
- in case of disaster of the active registry node (VM shutdown, corruption, etc), we can switch the main DNS and start the docker registry daemon in the standby node
- we may loss the differential data in the registry since the last sync. That can be solved easily by pushing again the docker images, but only a few instead of all of them.
This cold-standby mechanism, even if not perfect from the automation point of view, provides a robust improvement with regards the current situation.
Also, is really simple to implement. The missing bits for this are:
- rsync puppet code
- admin docs generation in wikitech https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Docker-registry