The reimage of cp2023 from varnish-be to ats-be yesterday T227432#5679084 means that there is no varnish backend left in codfw. This breaks request routing from eqiad varnish-be to codfw varnish-be for active/passive services active in codfw only. At the time of this writing, the only such service should be docker-registry.wikimedia.org. See T238792. However, we might for operational reasons have to switchover a/p services to codfw and that's now broken.
The approach proposed by @CDanis https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552142/ is clever, but (1) there is no ipsec on ats-be hosts (2) we have no idea whether varnish-be talking to ats-be would interoperate properly, and in general (3) the whole design is based on the fact that the only clients of ats-be are varnish-fe.
@Joe and I have explored a few options on irc but so far the most promising one seems to be roughly:
- re-reimage one codfw ats-be host, let's say cp2023, back to varnish-be to fix routing of eqiad docker-registry requests to codfw
- finish the migration of eqiad to ATS except for one host so that requests from cp2023 varnish-be to eqiad varnish-be keep on working
- depool cp2023
- finish reimaging eqiad
- reimage cp2023
All this could be avoided if we could DNS depool docker-registry (and any other service needing a switchover) in eqiad and make sure that all clients get the codfw IP instead, but I do not think this can be done?