Page MenuHomePhabricator

Request routing to active/passive services active in codfw only stopped working
Closed, ResolvedPublic

Description

The reimage of cp2023 from varnish-be to ats-be yesterday T227432#5679084 means that there is no varnish backend left in codfw. This breaks request routing from eqiad varnish-be to codfw varnish-be for active/passive services active in codfw only. At the time of this writing, the only such service should be docker-registry.wikimedia.org. See T238792. However, we might for operational reasons have to switchover a/p services to codfw and that's now broken.

The approach proposed by @CDanis https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552142/ is clever, but (1) there is no ipsec on ats-be hosts (2) we have no idea whether varnish-be talking to ats-be would interoperate properly, and in general (3) the whole design is based on the fact that the only clients of ats-be are varnish-fe.

@Joe and I have explored a few options on irc but so far the most promising one seems to be roughly:

  • re-reimage one codfw ats-be host, let's say cp2023, back to varnish-be to fix routing of eqiad docker-registry requests to codfw
  • finish the migration of eqiad to ATS except for one host so that requests from cp2023 varnish-be to eqiad varnish-be keep on working
  • depool cp2023
  • finish reimaging eqiad
  • reimage cp2023

All this could be avoided if we could DNS depool docker-registry (and any other service needing a switchover) in eqiad and make sure that all clients get the codfw IP instead, but I do not think this can be done?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Nov 21 2019, 9:24 AM
ema moved this task from Backlog to Caching on the Traffic board.

Change 552218 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: reimage cp2023 as text_ats"

https://gerrit.wikimedia.org/r/552218

Mentioned in SAL (#wikimedia-operations) [2019-11-21T09:39:27Z] <ema> depool cp2023 and reimage back as varnish-be T238817 T227432

Change 552218 merged by Ema:
[operations/puppet@production] Revert "cache: reimage cp2023 as text_ats"

https://gerrit.wikimedia.org/r/552218

Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts:

['cp2023.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911210942_ema_6885.log.

Completed auto-reimage of hosts:

['cp2023.codfw.wmnet']

Of which those FAILED:

['cp2023.codfw.wmnet']

HTTP routing of docker-registry looks good to me now:

$ curl -s --resolve docker-registry.wikimedia.org:443:208.80.154.224 -v https://docker-registry.wikimedia.org 2>&1 | egrep "< (HTTP|x-cache)"
< HTTP/2 200 
< x-cache: cp2023 pass, cp1087 pass, cp1085 pass
< x-cache-status: pass

\o/. Thanks for taking care of this!

Mentioned in SAL (#wikimedia-operations) [2019-12-19T12:52:29Z] <ema> depool cp2023 and cp1089 for ATS reimages T227432. Reimaged together because of T238817

ema claimed this task.

Having finished the transition to ATS T227432, there is no routing between cache backends anymore.