Page MenuHomePhabricator

Roll out Anycast RecDNS to more servers
Open, NormalPublic0 Story Points

Description

Now that the Anycast RecDNS service is up and running with live traffic, we can progressively make more servers use it in their resolv.conf.

Rollout status update: things that are using anycast recdns resolv.conf in production as of 2019-07-31:

  • All hosts in edge DCs (esams, ulsfo, eqsin)
  • All cp edge cache hosts globally
  • All LVS hosts globally
  • Canary Mediawiki API and Appserver hosts in both core DCs
  • Network devices
  • Install-time stuff (as in dhcp settings and Debian installer)

Event Timeline

ayounsi triaged this task as Normal priority.Jul 16 2019, 4:50 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptJul 16 2019, 4:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack moved this task from Triage to DNS Infra on the Traffic board.Jul 17 2019, 3:49 PM

Change 520441 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for all LVS balancers

https://gerrit.wikimedia.org/r/520441

Change 525549 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for eqsin LVS balancers

https://gerrit.wikimedia.org/r/525549

Change 525550 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for esams LVS balancers

https://gerrit.wikimedia.org/r/525550

Change 525551 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for ulsfo LVS balancers

https://gerrit.wikimedia.org/r/525551

Change 525552 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for codfw LVS balancers

https://gerrit.wikimedia.org/r/525552

Change 524067 had a related patch set uploaded (by BBlack; owner: Ayounsi):
[operations/puppet@production] Anycast, make recdns VIP alerts page

https://gerrit.wikimedia.org/r/524067

Change 524067 merged by BBlack:
[operations/puppet@production] Anycast, make recdns VIP alerts page

https://gerrit.wikimedia.org/r/524067

Change 525556 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Anycast recdns monitoring: use raw IP

https://gerrit.wikimedia.org/r/525556

Change 525556 merged by BBlack:
[operations/puppet@production] Anycast recdns monitoring: use raw IP

https://gerrit.wikimedia.org/r/525556

Change 525566 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Anycast recdns mon: use raw IP, for real this time

https://gerrit.wikimedia.org/r/525566

Change 525566 merged by BBlack:
[operations/puppet@production] Anycast recdns mon: use raw IP, for real this time

https://gerrit.wikimedia.org/r/525566

Change 525549 merged by BBlack:
[operations/puppet@production] anycast recdns: use for eqsin LVS balancers

https://gerrit.wikimedia.org/r/525549

Mentioned in SAL (#wikimedia-operations) [2019-07-25T16:44:20Z] <bblack> lvs5003 - restart pybal for resolv.conf change - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T16:50:54Z] <bblack> lvs5002 - restart pybal for resolv.conf change - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T16:54:06Z] <bblack> lvs5001 - restart pybal for resolv.conf change - T228190

Change 525550 merged by BBlack:
[operations/puppet@production] anycast recdns: use for esams LVS balancers

https://gerrit.wikimedia.org/r/525550

Change 525551 merged by BBlack:
[operations/puppet@production] anycast recdns: use for ulsfo LVS balancers

https://gerrit.wikimedia.org/r/525551

Change 525552 merged by BBlack:
[operations/puppet@production] anycast recdns: use for codfw LVS balancers

https://gerrit.wikimedia.org/r/525552

Mentioned in SAL (#wikimedia-operations) [2019-07-25T21:07:39Z] <bblack> backup lvses in codfw, esams, ulsfo: restart pybal for resolv.conf changes - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T21:38:20Z] <bblack> primary high-traffic1 lvses in codfw, esams, ulsfo: restart pybal for resolv.conf changes - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T21:47:05Z] <bblack> primary high-traffic2 lvses in codfw, esams, ulsfo: restart pybal for resolv.conf changes - T228190

Change 520441 merged by BBlack:
[operations/puppet@production] anycast recdns: use for all LVS balancers

https://gerrit.wikimedia.org/r/520441

Mentioned in SAL (#wikimedia-operations) [2019-07-25T21:59:06Z] <bblack> lvs1016 - restart pybal for resolv.conf changes - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T22:02:45Z] <bblack> lvs1015 - restart pybal for resolv.conf changes - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T22:04:41Z] <bblack> lvs1014 - restart pybal for resolv.conf changes - T228190

Mentioned in SAL (#wikimedia-operations) [2019-07-25T22:07:03Z] <bblack> lvs1013 - restart pybal for resolv.conf changes - T228190

All the LVSes are now using the anycasted recdns, which gets rid of the LVS<->recdns dependency loop and simplifies recdns server downtime processes: https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&type=revision&diff=1833705&oldid=1832671

Change 526163 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for all cache_upload

https://gerrit.wikimedia.org/r/526163

Change 526164 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for all cache_text

https://gerrit.wikimedia.org/r/526164

Change 526163 merged by BBlack:
[operations/puppet@production] anycast recdns: use for all cache_upload

https://gerrit.wikimedia.org/r/526163

Change 526169 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for all hosts at edge sites

https://gerrit.wikimedia.org/r/526169

Change 526164 merged by BBlack:
[operations/puppet@production] anycast recdns: use for all cache_text

https://gerrit.wikimedia.org/r/526164

Change 526177 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: use for all install-time DNS

https://gerrit.wikimedia.org/r/526177

Change 526178 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: Add to calico filters

https://gerrit.wikimedia.org/r/526178

Mentioned in SAL (#wikimedia-operations) [2019-07-29T22:28:41Z] <XioNoX> roll out anycast DNS and syslog to all network devices - T228190

Change 526389 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Add anycast recdns to calico filters

https://gerrit.wikimedia.org/r/526389

Change 526389 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Add anycast recdns to calico filters

https://gerrit.wikimedia.org/r/526389

Change 526178 merged by Alexandros Kosiaris:
[operations/puppet@production] anycast recdns: Add to calico filters

https://gerrit.wikimedia.org/r/526178

Change 526465 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] recdns: refactor and rationalize resolv.conf

https://gerrit.wikimedia.org/r/526465

Change 526465 merged by BBlack:
[operations/puppet@production] recdns: refactor and rationalize resolv.conf

https://gerrit.wikimedia.org/r/526465

Change 526169 merged by BBlack:
[operations/puppet@production] anycast recdns: use for all hosts at edge sites

https://gerrit.wikimedia.org/r/526169

Change 526684 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: edge sites via realm.pp (nop)

https://gerrit.wikimedia.org/r/526684

Change 526685 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: set for canary api/appservers

https://gerrit.wikimedia.org/r/526685

Change 526684 merged by BBlack:
[operations/puppet@production] anycast recdns: edge sites via realm.pp (nop)

https://gerrit.wikimedia.org/r/526684

Change 526177 merged by BBlack:
[operations/puppet@production] anycast recdns: use for all install-time DNS

https://gerrit.wikimedia.org/r/526177

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['cp1008.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201907311739_bblack_92388.log.

Change 526685 merged by BBlack:
[operations/puppet@production] anycast recdns: set for canary api/appservers

https://gerrit.wikimedia.org/r/526685

Completed auto-reimage of hosts:

['cp1008.wikimedia.org']

Of which those FAILED:

['cp1008.wikimedia.org']

Rollout status update: things that are using anycast recdns resolv.conf in production as of 2019-07-31:

  • All hosts in edge DCs (esams, ulsfo, eqsin)
  • All cp edge cache hosts globally
  • All LVS hosts globally
  • Canary Mediawiki API and Appserver hosts in both core DCs
  • Network devices
  • Install-time stuff (as in dhcp settings and Debian installer)
BBlack updated the task description. (Show Details)Wed, Jul 31, 6:32 PM

Change 526788 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: enable for codfw clients

https://gerrit.wikimedia.org/r/526788

Change 528440 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add profile::base::nameservers

https://gerrit.wikimedia.org/r/528440

Change 528440 merged by Ema:
[operations/puppet@production] ATS: add profile::base::nameservers

https://gerrit.wikimedia.org/r/528440

Change 528524 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: config for many eqiad canaries

https://gerrit.wikimedia.org/r/528524

Change 528525 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] anycast recdns: enable globally

https://gerrit.wikimedia.org/r/528525

I'm not sure if it goes as a subtask here, or of T167841 and/or T227808 - but recording here so we don't forget, from an earlier IRC conversation:

As things stand, if eqiad or codfw were to lose both local recursors and thus all local adverts of the anycast, they don't differentiate all the possible remote fallback options (e.g. the opposite core site vs much higher-latency alternatives like the edge sites), because we're letting the anycast advert propagate around our network freely from everywhere to everywhere. While core<->core is only a ~40ms fallback, reaching out to the far-flung edges for recursor fallback could raise recdns latency for core-site services out into the 200ms+ range, which would probably have a higher chance of negative impact. There's also just a lot more services and code in the core sites (vs the small stack of stuff in the edges), and there will probably always be some that use recdns in less-than-ideal ways and thus are very sensitive to it.

What we talked about on IRC was changing the router filtering for the anycast stuff such that the core sites successfully advertise the internal anycast space globally (are fallbacks for the opposite core site and all edges), but the edge sites do not advertise the anycast space back towards the core or other edge sites (so that if they lose their local recursors, they fall back to one of the core sites). We should probably block on this before doing any whole-site conversion of either eqiad or codfw to anycast recdns, just in case!