Restrouter health checks fail when local wikifeeds instance is not pool in discovery records
Open, LowPublic
Actions

Assigned To

None

Authored By

	akosiaris
	Dec 18 2019, 3:22 PM

Description

Today I 've gone through the process of reinitializing the codfw kubernetes cluster. Part of the process was to direct edge caches and discovery records away from codfw before and reverting after.

Everything has gone according to plan and the entire cluster has been reinitialized using an etcd3 backing datastore.

Before the repooling of codfw in the discovery we noticed though the following. Times are UTC

(15:00:32) icinga-wm: PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase

then

(15:06:45) logmsgbot: !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(eventgate.*|mathoid|citoid|restrouter|sessionstore|echostore|zotero|termbox|wikifeeds|cxserver|blubberoid)

and

(15:10:34) icinga-wm: RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase

The timing is consistent with the 5m TTL for discovery DNS records.

Which roughly means (at this point at least) that restrouter is unable to use the cross-dc wikifeeds instance.

What's even more concerning is that RESTBase did not complain about the exact same infrastructure situation and it is quite possibly worth a look before we rely on RESTRouter more.

Related Objects

Mentioned In: T330850: [wm-checks-api] support EarlyWarningBot
Mentioned Here: T214068: Display Zuul status of jobs for a change on Gerrit UI
T330850: [wm-checks-api] support EarlyWarningBot

Event Timeline

akosiaris created this task.Dec 18 2019, 3:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2019, 3:22 PM

akosiaris triaged this task as Low priority.Dec 18 2019, 3:22 PM

Eevans edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.Jan 10 2020, 5:47 PM

WDoranWMF moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Mar 12 2020, 1:35 PM

• AMooney moved this task from Backlog to Teleport on the Platform Team Workboards (Clinic Duty Team) board.Mar 24 2020, 1:47 PM

WDoranWMF edited projects, added Platform Engineering (Icebox); removed Platform Team Workboards (Clinic Duty Team).Mar 24 2020, 9:48 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:20:54Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068)

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:21:01Z] <hashar@deploy2002> Finished deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) (duration: 00m 07s)

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:25:58Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068)

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:26:08Z] <hashar@deploy2002> Finished deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) (duration: 00m 10s)

Those @Stashbot entries were intended for T214068: Display Zuul status of jobs for a change on Gerrit UI

Restrouter health checks fail when local wikifeeds instance is not pool in discovery recordsOpen, LowPublicActions

Description

Related Objects

Event Timeline

Restrouter health checks fail when local wikifeeds instance is not pool in discovery records
Open, LowPublic
Actions