Request routing to active/passive services active in codfw only stopped working
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Nov 21 2019, 9:24 AM

Description

The reimage of cp2023 from varnish-be to ats-be yesterday T227432#5679084 means that there is no varnish backend left in codfw. This breaks request routing from eqiad varnish-be to codfw varnish-be for active/passive services active in codfw only. At the time of this writing, the only such service should be docker-registry.wikimedia.org. See T238792. However, we might for operational reasons have to switchover a/p services to codfw and that's now broken.

The approach proposed by @CDanis https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552142/ is clever, but (1) there is no ipsec on ats-be hosts (2) we have no idea whether varnish-be talking to ats-be would interoperate properly, and in general (3) the whole design is based on the fact that the only clients of ats-be are varnish-fe.

@Joe and I have explored a few options on irc but so far the most promising one seems to be roughly:

re-reimage one codfw ats-be host, let's say cp2023, back to varnish-be to fix routing of eqiad docker-registry requests to codfw
finish the migration of eqiad to ATS except for one host so that requests from cp2023 varnish-be to eqiad varnish-be keep on working
depool cp2023
finish reimaging eqiad
reimage cp2023

All this could be avoided if we could DNS depool docker-registry (and any other service needing a switchover) in eqiad and make sure that all clients get the codfw IP instead, but I do not think this can be done?

Details

	Subject	Repo	Branch	Lines +/-
	Revert "cache: reimage cp2023 as text_ats"	operations/puppet	production	+8 -3

Customize query in gerrit

Related Objects

Mentioned In: T233474: Ensure graphs used by Performance account for Varnish-to-ATS migration
T238792: Wikifeeds deployment failed in staging
T227432: Replace Varnish backends with ATS on cache text nodes
Mentioned Here: T227432: Replace Varnish backends with ATS on cache text nodes
T238792: Wikifeeds deployment failed in staging

Event Timeline

• ema created this task.Nov 21 2019, 9:24 AM

Restricted Application added a project: SRE. · View Herald TranscriptNov 21 2019, 9:24 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as High priority.Nov 21 2019, 9:24 AM

• ema moved this task from Backlog to Caching on the Traffic board.

• ema updated the task description. (Show Details)Nov 21 2019, 9:28 AM

Change 552218 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "cache: reimage cp2023 as text_ats"

https://gerrit.wikimedia.org/r/552218

gerritbot added a project: Patch-For-Review.Nov 21 2019, 9:36 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-21T09:39:27Z] <ema> depool cp2023 and reimage back as varnish-be T238817 T227432

Stashbot mentioned this in T227432: Replace Varnish backends with ATS on cache text nodes.Nov 21 2019, 9:39 AM

• ema updated the task description. (Show Details)Nov 21 2019, 9:40 AM

Change 552218 merged by Ema:
[operations/puppet@production] Revert "cache: reimage cp2023 as text_ats"

https://gerrit.wikimedia.org/r/552218

Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts:

['cp2023.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911210942_ema_6885.log.

akosiaris mentioned this in T238792: Wikifeeds deployment failed in staging.Nov 21 2019, 9:56 AM

Completed auto-reimage of hosts:

['cp2023.codfw.wmnet']

Of which those FAILED:

['cp2023.codfw.wmnet']

Maintenance_bot removed a project: Patch-For-Review.Nov 21 2019, 10:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-21T10:22:37Z] <ema> pool cp2023 with Varnish backend T238817 T227432

HTTP routing of docker-registry looks good to me now:

$ curl -s --resolve docker-registry.wikimedia.org:443:208.80.154.224 -v https://docker-registry.wikimedia.org 2>&1 | egrep "< (HTTP|x-cache)"
< HTTP/2 200 
< x-cache: cp2023 pass, cp1087 pass, cp1085 pass
< x-cache-status: pass

\o/. Thanks for taking care of this!

Mentioned in SAL (#wikimedia-operations) [2019-12-19T12:52:29Z] <ema> depool cp2023 and cp1089 for ATS reimages T227432. Reimaged together because of T238817

Having finished the transition to ATS T227432, there is no routing between cache backends anymore.

Krinkle mentioned this in T233474: Ensure graphs used by Performance account for Varnish-to-ATS migration.Dec 24 2019, 1:11 AM

Request routing to active/passive services active in codfw only stopped workingClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Request routing to active/passive services active in codfw only stopped working
Closed, ResolvedPublic
Actions