As part of the work in T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev to replace on cloudgw hardware server with other, I detected these DNS graphs:
Which correlates with the time a secondary server was online:
Before the actual, definitive server took over the cloudgw routing work:
Timeline is:
- cloudgw1002 is active, cloudgw1001 is standby
- manual fail-over happens, via shutting down keepalived on cloudgw1002
- cloudgw1001 is active, cloudgw1002 is standby
- we take cloudgw1002 down, reimage it with the ::insetup puppet role
- we reimage cloudgw1004 with the ::cloudgw puppet role
- manual fail-over happens, via shutting down keepalived on cloudgw1001
- cloudgw1004 is active, cloudgw1001 is standby
Apparently, during the time cloudgw1001 was active, somehow the BGP-based anycast VIP of the DNS recursor changed its routing path. Also, DNS resolution errors increased.
Graph sources:
- https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1&refresh=1m&from=1738664625106&to=1738686225106
- https://grafana.wikimedia.org/d/ded9b969-7207-4bde-9077-5f81457625c4/wmcs-openstack-eqiad-project-network-usage?orgId=1&from=1738664650739&to=1738686250740


