Page MenuHomePhabricator

Make swift containers for docker registry cross replicated.
Closed, ResolvedPublic

Description

Current docker registry instance is writing to a Swift container in codfw, to achieve HA we would like to maintain a "global" docker registry container synched in each DC. Giving the nature of what we are writing (docker layers) some delay over the synchronization will be acceptable.

The main idea to implement this is to use container sync feature and create two new containers (docker_registry_eqiad and docker_registry_codfw) configured to be mutually synchronized, so a write on the codfw cluster will be replicated over eqiad and the other way around.

There is a bunch of open questions related to this task:

  • How container synchronization will perform, there are signs that it was tested and discarded before.
  • How monitor container synchronization, naive answer will be log synchronization logging to another file and configure log forwarder to ship to ELK, also a simple list diff check in icinga. It seems there are no easy ways to monitor it as stated in the doc.

Additionally, it should be noted there is no way for an end user to detect sync progress or problems other than HEADing both containers and comparing the overall information

  • How to manage containers creation in Puppet, it seems that current swift containers are created by applications and only swift accounts and config are manage via Puppet, not containers.

Event Timeline

fsero triaged this task as Medium priority.Jan 21 2019, 10:56 AM
fsero created this task.

I've replicated this using a local SAIO setup and it seems to work, however obviously we are avoiding network latency here hence the open question about performance.

Change 490073 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] Enabling docker registry swift replication

https://gerrit.wikimedia.org/r/490073

It seems there are some issues on the swift side regarding container-real-synchronization. I'll hold this for now and i'll work on bringing an HA service on one DC (eqiad) later on i would work with @fgiunchedi (or some any othe volunteer) to fix them and enable cross-dc replication on the swift side.

fsero changed the task status from Open to Stalled.Feb 26 2019, 4:30 PM

I'm happy to help in the future, although it will also be a learning exercise for me :)

With the help of @CDanis now PCC looks happy, @fgiunchedi is good for merge if you think so too.

fsero changed the task status from Stalled to Open.Feb 26 2019, 5:14 PM

It seems there are some issues on the swift side regarding container-real-synchronization. I'll hold this for now and i'll work on bringing an HA service on one DC (eqiad) later on i would work with @fgiunchedi (or some any othe volunteer) to fix them and enable cross-dc replication on the swift side.

Reporting it here too (previously a chat between me and @fsero): I'm happy to help with the swift synchronization container work however postponing said work until the beginning of next quarter when goals priorities are less stringent.

Change 490073 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] Enabling docker registry swift cross dc replication

https://gerrit.wikimedia.org/r/490073

Change 490073 merged by Filippo Giunchedi:
[operations/puppet@production] Enabling docker registry swift cross dc replication

https://gerrit.wikimedia.org/r/490073

Mentioned in SAL (#wikimedia-operations) [2019-04-10T08:46:28Z] <godog> roll-restart swift frontends - T214289

Change 503001 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] puppet exec { } doesn't like a bash builtin.

https://gerrit.wikimedia.org/r/503001

Change 503001 merged by Fsero:
[operations/puppet@production] puppet exec { } doesn't like a bash builtin.

https://gerrit.wikimedia.org/r/503001

Change 503938 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: minor fixes

https://gerrit.wikimedia.org/r/503938

Change 503938 merged by Fsero:
[operations/puppet@production] registryha: minor fixes

https://gerrit.wikimedia.org/r/503938

Change 503949 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: swift authurl != storageURL

https://gerrit.wikimedia.org/r/503949

Change 503949 merged by Fsero:
[operations/puppet@production] registryha: swift authurl != storageURL

https://gerrit.wikimedia.org/r/503949

Change 503962 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: replication configuration needs to omit 'cluster_' from cluster_name

https://gerrit.wikimedia.org/r/503962

Change 503962 merged by Fsero:
[operations/puppet@production] registryha: replication configuration needs to omit 'cluster_' from cluster_name

https://gerrit.wikimedia.org/r/503962

Change 503966 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: registry needs the specific swift authURL

https://gerrit.wikimedia.org/r/503966

Change 503966 merged by Fsero:
[operations/puppet@production] registryha: registry needs the specific swift authURL

https://gerrit.wikimedia.org/r/503966

Change 503968 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: missing redis addr on healthcheck

https://gerrit.wikimedia.org/r/503968

Change 503968 merged by Fsero:
[operations/puppet@production] registryha: missing redis addr on healthcheck

https://gerrit.wikimedia.org/r/503968

Change 503994 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: typo on template for redis password

https://gerrit.wikimedia.org/r/503994

Change 503994 merged by Fsero:
[operations/puppet@production] registryha: typo on template for redis password

https://gerrit.wikimedia.org/r/503994

Change 504063 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registryha: added 1 VM on eqiad and 2 on codfw

https://gerrit.wikimedia.org/r/504063

Change 504063 merged by Fsero:
[operations/puppet@production] registryha: added 1 VM on eqiad and 2 on codfw

https://gerrit.wikimedia.org/r/504063

I enabled cross replication for swift todayand it seems to work.

The replication seems to be quite slow, two hours after haven't completed synchronization. The first registry was set up on eqiad and the content populated with some images.

Test performed:

  • Added empty swift container -> objects started to appear due to replication.
  • Deleted object on codfw -> gets deleted on eqiad sometime after.
  • Deleted object on eqiad -> did not get deleted yet since synchronization hasn't ended.

We should think carefully if we want to allow this as an active-active model, however, the scope of this task is done so resolving it!

Yeah if replication model is eventual consistency, I think we just want a single discovery record that we make active/passive, and a cookbook (it's literally two commands to implement) for the switchover.