Today I 've gone through the process of reinitializing the codfw kubernetes cluster. Part of the process was to direct edge caches and discovery records away from codfw before and reverting after.
Everything has gone according to plan and the entire cluster has been reinitialized using an etcd3 backing datastore.
Before the repooling of codfw in the discovery we noticed though the following. Times are UTC
(15:00:32) icinga-wm: PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
then
(15:06:45) logmsgbot: !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(eventgate.*|mathoid|citoid|restrouter|sessionstore|echostore|zotero|termbox|wikifeeds|cxserver|blubberoid)
and
(15:10:34) icinga-wm: RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
The timing is consistent with the 5m TTL for discovery DNS records.
Which roughly means (at this point at least) that restrouter is unable to use the cross-dc wikifeeds instance.
What's even more concerning is that RESTBase did not complain about the exact same infrastructure situation and it is quite possibly worth a look before we rely on RESTRouter more.