Page MenuHomePhabricator

Restrouter health checks fail when local wikifeeds instance is not pool in discovery records
Open, LowPublic

Description

Today I 've gone through the process of reinitializing the codfw kubernetes cluster. Part of the process was to direct edge caches and discovery records away from codfw before and reverting after.

Everything has gone according to plan and the entire cluster has been reinitialized using an etcd3 backing datastore.

Before the repooling of codfw in the discovery we noticed though the following. Times are UTC

(15:00:32) icinga-wm: PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase

then

(15:06:45) logmsgbot: !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(eventgate.*|mathoid|citoid|restrouter|sessionstore|echostore|zotero|termbox|wikifeeds|cxserver|blubberoid)

and

(15:10:34) icinga-wm: RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase

The timing is consistent with the 5m TTL for discovery DNS records.

Which roughly means (at this point at least) that restrouter is unable to use the cross-dc wikifeeds instance.

What's even more concerning is that RESTBase did not complain about the exact same infrastructure situation and it is quite possibly worth a look before we rely on RESTRouter more.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:20:54Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068)

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:21:01Z] <hashar@deploy2002> Finished deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) (duration: 00m 07s)

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:25:58Z] <hashar@deploy2002> Started deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068)

Mentioned in SAL (#wikimedia-operations) [2023-03-24T05:26:08Z] <hashar@deploy2002> Finished deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) (duration: 00m 10s)