Page MenuHomePhabricator

Cloud DNS: investigate weird graph during cloudgw replacement operation
Closed, InvalidPublic

Description

As part of the work in T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev to replace on cloudgw hardware server with other, I detected these DNS graphs:

image.png (468×1 px, 95 KB)

Which correlates with the time a secondary server was online:

image.png (239×1 px, 33 KB)

Before the actual, definitive server took over the cloudgw routing work:

image.png (239×1 px, 44 KB)

Timeline is:

  • cloudgw1002 is active, cloudgw1001 is standby
  • manual fail-over happens, via shutting down keepalived on cloudgw1002
  • cloudgw1001 is active, cloudgw1002 is standby
  • we take cloudgw1002 down, reimage it with the ::insetup puppet role
  • we reimage cloudgw1004 with the ::cloudgw puppet role
  • manual fail-over happens, via shutting down keepalived on cloudgw1001
  • cloudgw1004 is active, cloudgw1001 is standby

Apparently, during the time cloudgw1001 was active, somehow the BGP-based anycast VIP of the DNS recursor changed its routing path. Also, DNS resolution errors increased.

Graph sources:

NOTE: No DNS malfunction was reported during this

Event Timeline

aborrero triaged this task as Low priority.
aborrero updated the task description. (Show Details)

There isn't any mystery here.

Anycast routing is hot-potato. If cloudgw1001 is the active one, with cloudsw1-c8-eqiad as its gateway, then traffic for the VIP will route to cloudservices1006 - as it's in the same rack.

The cloudservices nodes announce the routes with no specific parameters set to make one or other primary. So all things being equal the switch will route to the nearest instance.