We have been spending quite a lot of time in investigating and finding workarounds to fix inconsistency issues brought up by the Swift backend of Docker Distribution: T390251, T401533, T391935, T406392 (and possibly many more, tracked or not).
After a chat with Alexandros, we came up with the following high level plan as proposal to move things forward in Q3:
- Do basic testing with Docker Distribution and Ceph. Alexandros already started it, we have a separate docker distribution instance on registry hosts that uses Ceph as backend (see T394476 for the Data Persistence Part). We'd need to pick it up, and complete it (reasonable load test, push/pull of various image sizes, etc..). We'd need to pay attention to bottlenecks with the new storage infrastructure, and (hopefully few irrelevant) bugs in the Docker Distribution implementation of the Ceph driver/engine.
- Once we are reasonably sure that the new storage infrastructure and configuration, we could proceed with a simple test - we coordinate with Releng and on a certain day, we flip the /restricted Docker Registry prefix to the Docker Distribution instance with Ceph. The scap workflow will push the new image, and the Wikikube worker nodes will pull it. This flip is very easy since the main point of contact is still nginx (on the registry nodes), so we can flip the backend in its config very easily. In case of problems, we can revert the settings very quickly. The nice part is that no change will be needed on the k8s workers, it will be transparent to them. We thought about using more conservative approaches, but it may take a huge amount of time and we are not sure if we'd gain more reliability (for example, targeting only slices of Wikikube workers etc..).
- Move more prefixes/images to the new backend, working with Data Persistence along the way to make sure that we are good capacity wise etc..
Just to clarify, this is a short/medium solution to hopefully move away from the aforementioned problems and frustrations that we have been experiencing so far. The long term goal is more challenging, namely: do we want to stick with Docker Distribution? Do we want to move away from it, towards another open source solution? etc.. But it will require more time allocated by multiple teams, something that we don't have at the moment. So let's start with something easy enough to achieve :)
Thoughts and opinions are welcome!