Page MenuHomePhabricator

Request routing to active/passive services active in codfw only stopped working
Open, HighPublic

Description

The reimage of cp2023 from varnish-be to ats-be yesterday T227432#5679084 means that there is no varnish backend left in codfw. This breaks request routing from eqiad varnish-be to codfw varnish-be for active/passive services active in codfw only. At the time of this writing, the only such service should be docker-registry.wikimedia.org. See T238792. However, we might for operational reasons have to switchover a/p services to codfw and that's now broken.

The approach proposed by @CDanis https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552142/ is clever, but (1) there is no ipsec on ats-be hosts (2) we have no idea whether varnish-be talking to ats-be would interoperate properly, and in general (3) the whole design is based on the fact that the only clients of ats-be are varnish-fe.

@Joe and I have explored a few options on irc but so far the most promising one seems to be roughly:

  • re-reimage one codfw ats-be host, let's say cp2023, back to varnish-be to fix routing of eqiad docker-registry requests to codfw
  • finish the migration of eqiad to ATS except for one host so that requests from cp2023 varnish-be to eqiad varnish-be keep on working
  • depool cp2023
  • finish reimaging eqiad
  • reimage cp2023

All this could be avoided if we could DNS depool docker-registry (and any other service needing a switchover) in eqiad and make sure that all clients get the codfw IP instead, but I do not think this can be done?

Details

Related Gerrit Patches:
operations/puppet : productionRevert "cache: reimage cp2023 as text_ats"

Event Timeline

ema created this task.Thu, Nov 21, 9:24 AM
Restricted Application added a project: Operations. · View Herald TranscriptThu, Nov 21, 9:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Thu, Nov 21, 9:24 AM
ema moved this task from Triage to Caching on the Traffic board.
ema updated the task description. (Show Details)Thu, Nov 21, 9:28 AM

Change 552218 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: reimage cp2023 as text_ats"

https://gerrit.wikimedia.org/r/552218

Mentioned in SAL (#wikimedia-operations) [2019-11-21T09:39:27Z] <ema> depool cp2023 and reimage back as varnish-be T238817 T227432

ema updated the task description. (Show Details)Thu, Nov 21, 9:40 AM

Change 552218 merged by Ema:
[operations/puppet@production] Revert "cache: reimage cp2023 as text_ats"

https://gerrit.wikimedia.org/r/552218

Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts:

['cp2023.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911210942_ema_6885.log.

Completed auto-reimage of hosts:

['cp2023.codfw.wmnet']

Of which those FAILED:

['cp2023.codfw.wmnet']

Mentioned in SAL (#wikimedia-operations) [2019-11-21T10:22:37Z] <ema> pool cp2023 with Varnish backend T238817 T227432

ema added a comment.Thu, Nov 21, 10:26 AM

HTTP routing of docker-registry looks good to me now:

$ curl -s --resolve docker-registry.wikimedia.org:443:208.80.154.224 -v https://docker-registry.wikimedia.org 2>&1 | egrep "< (HTTP|x-cache)"
< HTTP/2 200 
< x-cache: cp2023 pass, cp1087 pass, cp1085 pass
< x-cache-status: pass

\o/. Thanks for taking care of this!