With Anycast recdns fully deployed for some time now, the traffic to LVS recdns has dropped off substantially. Quick checks show only healthcheck monitoring and a few queries from some hardware devices like PDUs to clean up. This task is to track down and eliminate the remaining few cases and the decom of these service IPs and associated LVS configuration, etc.
In a sample I just took across all recdns for a little over 15 minutes of sniffer time looking for requests to the legacy LVS-based recdns IPs:
- ulsfo, eqsin, and esams had no traffic to them at all (yay! and makes basic sense)
- eqiad had a handful of requests from:
- codfw had more-interesting traffic from:
The PDUs I kind of expected. IIRC some of them can't be updated easily, and honestly they're not a huge problem. Will dig a bit more on those other cases!
Dug into the odd cases from install2002 and kraz - the common pattern here is that there are some daemons in the world which both (a) parse /etc/resolv.conf for themselves because they use their own custom DNS client code and (b) don't ever re-read that file if it changes. A few of those are daemons we actually use, which happen to have not had their daemon (or the host) restarted since our resolv.conf was switched to the new recdns IP a few months ago (~Aug-Sept timeframe, it was rolled out at different times to different places).
In these particular cases, install2002 needed a squid3 daemon restart (done), and for the kraz case it's ircd (which is an old version of ircd-ratbox used for mw_rc_irc stuff (which I haven't restarted, because I'm not sure how fragile that stuff is)).
Next week I might do a much longer sniff (hours), and see if I can find any more such edge cases.
Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but not the core ones. What remains is that the legacy LVS recdns IPs 22.214.171.124 (eqiad) and 126.96.36.199 (codfw) are still statically-configured to avoid breaking any of the leftover dependencies on these IPs. Sniffer monitoring has shown at least the ircd instance on kraz is still using outdated resolv.conf data and hitting these IPs, several hardware PDUs are using them as well, and there are possibly other such cases which are rarer and thus harder to observe in short samples (I've done up to 1h samples).
The static (as in non-LVS) configuration of these is puppetized, and the eqiad and codfw core routers have explicit static routes sending 188.8.131.52 to dns1002 and 184.108.40.206 to dns2002 (the 01 boxes are also acceptable backup targets if necessary). The routes in the juniper configs are tagged with a comment referencing this ticket.
Once we're sure we're ready to destroy these last remnants of service (after the holidays! and investigating the remaining PDUs situation and kraz and taking longer sniffs), what remains to finish decomming these and close this ticket up is:
- Remove the manual routes referenced above from cr-(eqiad|codfw)
- Merge and deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556178/ (carefully one at a time on dns00), removing the service IP listeners and IP address defs)
- Merge and deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556179/ (doesn't have to be as careful, just cleans up the bits that remain after the above in the puppet sense)
- Merge and deploy the DNS patch https://gerrit.wikimedia.org/r/#/c/operations/dns/+/556230/ (removes the last comment lines noting that these IPs are still in use)
The kraz case is gone now (yay!) and hasn't recurred since the ircd restart above. What's left appears to be all infrastructure stuff: PDUs, switches, firewalls, etc. I've picked up quite a few of them in a few hours, so I'm going to let it run for a full 24h to try to capture them all, and then I'll make some sub-tasks to clean them up.