Page MenuHomePhabricator

Create HA setup for DNS recursion
Closed, ResolvedPublic

Description

Too many things depend on the DNS recursors and break when the primary DNS recursor is down or slow to respond. This crashes the entire site. We should keep fixing these problems client-side as well, but having a HA setup for DNS recursion is definitely a good idea at this point, as it can prevent much downtime.

Details

Reference
rt292

Related Objects

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 12:46 AM
rtimport added a project: ops-core.
rtimport set Reference to rt292.

Status changed from 'new' to 'open' by mark

This is now done for eqiad, where the two DNS recursors sit behind LVS. pmtpa,
esams and ulsfo need to follow.
--
Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Wikimedia Foundation

Current situation is that we have HA dns recursors set up behind LVS in eqiad and codfw, a single non-HA recursor in esams, and no recursors in ulsfo. The resolv.conf situation is that each site has a primary and backup resolver address (for one of the HA clusters or esams single host), and the resolver mapping is:

eqiad: eqiad, codfw
codfw: codfw, eqiad
ulsfo: eqiad, codfw
esams: esams, eqiad

We still need to setup a second DNS server + HA/LVS in esams, and set up (from scratch) HA recursors in ulsfo (and switch its resolv to "ulsfo, codfw"). The standard setup now is to do both recdns and NTP on these machines, so we'd probably expand our NTP setup alongside this as well virtually for-free.

The latter may need new hardware (recursors are not that resource intensive, though. Maybe we could ship over some unused/older eqiad hw?).

faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon set Security to None.

We have redundant, HA recursors at all sites now via LVS. Next stages of this effort are anycast-related in T186550