Page MenuHomePhabricator

Anycast recdns
Closed, ResolvedPublic

Description

The primary, titular goal here is to increase our reliability and kill some resolv.conf config complexity by having internal recdns anycast IPs that are reliable across our network. This also gets us some experience relevant to the (later) task of doing public anycast AuthDNS (parent task).

The simplest way to do this would be to add them as LVS public IPs for the existing LVS anycast services. LVS would still monitor backend pdns_recursor health, and handle BGP advertisements to the routers.

Another route we've been exploring here, however, is to also take LVS out of the picture and have the recdns servers self-monitor and advertise the anycast IP directly. There and up- and down- side tradeoffs to that design decision to be discussed!

Wiki on specifics: https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS

Event Timeline

BBlack triaged this task as Medium priority.Feb 5 2018, 6:33 PM
BBlack created this task.

Change 397723 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Bird anycast: add anycast_healthchecker

https://gerrit.wikimedia.org/r/397723

Change 520284 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: test via resolv.conf on 5 hosts

https://gerrit.wikimedia.org/r/520284

Change 520284 merged by BBlack:
[operations/puppet@production] anycast recdns: test via resolv.conf on 5 hosts

https://gerrit.wikimedia.org/r/520284

Change 520441 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for LVS balancers

https://gerrit.wikimedia.org/r/520441

Change 520528 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Reserve 10.3.0.0/30 for recdns anycast (backup static route)

https://gerrit.wikimedia.org/r/520528

Change 520528 merged by Ayounsi:
[operations/dns@master] Reserve 10.3.0.0/30 for recdns anycast (backup static route)

https://gerrit.wikimedia.org/r/520528

Mentioned in SAL (#wikimedia-operations) [2019-07-03T20:58:58Z] <XioNoX> add static backup routes for anycast recdns on cr1/2-codfw/eqiad - T186550

Change 520643 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Bird anycast, add monitoring for anycast-healthchecker

https://gerrit.wikimedia.org/r/520643

Everything in the scope of that task is completed.

It's my understanding that this reduces the steps necessary to restart our recursors is now reduced to a simple depool/repool and that the previous, complex approach from
https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) is now obsolete, right?

It will eventually. Only a few servers are using the new IPs for now, I opened T228190 to roll it out.

It's my understanding that this reduces the steps necessary to restart our recursors is now reduced to a simple depool/repool and that the previous, complex approach from
https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) is now obsolete, right?

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520441/ (not yet deployed) will get rid of the LVS<->recdns dependency mess documented there.

Change 520643 merged by Ayounsi:
[operations/puppet@production] Bird anycast, add monitoring for anycast-healthchecker

https://gerrit.wikimedia.org/r/520643

elukey reopened this task as Open.EditedAug 1 2019, 7:24 AM
elukey added a subscriber: fgiunchedi.

Couple of notes about the anycast-healthchecker:

  1. the anycast-healthchecker is not in jessie-wikimedia, so puppet on lithium/wezen is currently broken:
root@install1002:/srv/wikimedia# reprepro lsbycomponent anycast-healthchecker
anycast-healthchecker | 0.8.2-1 | stretch-wikimedia | main | amd64, i386, source
anycast-healthchecker | 0.8.2-1 |  buster-wikimedia | main | amd64, i386, source
  1. python3-docopt seems to be required by the healthchecker's nagios monitor, and it was missing on lithium. I installed it manually via apt on it and opened https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526849/
  1. Due to 1), the anycast-healthchecker is not deployed on lithium/wezen (jessies) and in turn causes the nagios check to fire criticals in icinga. I have acked them waiting for a fix, since I am not sure what is best.
  1. https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS#Anycast_healthchecker_not_running seems not present

Cc: @fgiunchedi as FYI

Thanks @elukey ! Indeed anycast-healthchecker isn't in jessie-wikimedia, lithium is being decom'd and if wezen gets reinstalled it'll be buster, and I installed anycast-healthchecker manually there there. No real reason though to skip jessie-wikimedia so I'll copy the packages there too.

Couple of notes about the anycast-healthchecker:

  1. python3-docopt seems to be required by the healthchecker's nagios monitor, and it was missing on lithium. I installed it manually via apt on it and opened https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526849/

Thanks, merged.

  1. https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS#Anycast_healthchecker_not_running seems not present

Fixed.

  1. and 3) addressed by Filippo above.

Change 554866 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast healthchecker should be able to bind

https://gerrit.wikimedia.org/r/554866

Change 554867 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast: bind to real service

https://gerrit.wikimedia.org/r/554867

Change 554868 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] bird: 10s delay after route withdraw

https://gerrit.wikimedia.org/r/554868

Change 554866 merged by BBlack:
[operations/puppet@production] anycast healthchecker should be able to bind

https://gerrit.wikimedia.org/r/554866

Change 554867 merged by BBlack:
[operations/puppet@production] anycast: bind to real service

https://gerrit.wikimedia.org/r/554867

Change 554868 merged by BBlack:
[operations/puppet@production] bird: 10s delay after route withdraw

https://gerrit.wikimedia.org/r/554868