Decom LVS recdns
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Dec 6 2019, 2:00 PM

Description

With Anycast recdns fully deployed for some time now, the traffic to LVS recdns has dropped off substantially. Quick checks show only healthcheck monitoring and a few queries from some hardware devices like PDUs to clean up. This task is to track down and eliminate the remaining few cases and the decom of these service IPs and associated LVS configuration, etc.

Details

Subject	Repo	Branch	Lines +/-
Remove all legacy_vip entries	operations/puppet	production	+1 -23
lvs recdns: remove legacy IP definition, step 2	operations/puppet	production	+0 -15
lvs recdns: remove legacy IP definition, step 1	operations/puppet	production	+3 -10
lvs recnds: remove last remaining revdns comments	operations/dns	master	+0 -2
lvs recdns: get rid of legacy recursor hostnames	operations/dns	master	+0 -15
lvs recdns: clean up realserver def	operations/puppet	production	+0 -3
lvs recdns: eqiad and codfw keep old addr, for now	operations/puppet	production	+30 -5
lvs recdns: decom lvs-specific parts	operations/puppet	production	+1 -73
lvs recdns: switch DNS aliases to anycast	operations/dns	master	+17 -22

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		BCornwall	T239993 Decom LVS recdns
		Resolved		BCornwall	T254178 Fix recdns config on various hardware devices

Event Timeline

BBlack created this task.Dec 6 2019, 2:00 PM

Restricted Application added a project: SRE. · View Herald TranscriptDec 6 2019, 2:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

BBlack mentioned this in T211131: DNS recursors TCP retransmits.Dec 6 2019, 2:01 PM

BBlack moved this task from Backlog to Some old column on the Traffic board.

Change 555520 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] lvs recdns: switch DNS aliases to anycast

https://gerrit.wikimedia.org/r/555520

gerritbot added a project: Patch-For-Review.Dec 6 2019, 3:56 PM

Change 555537 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs recdns decom

https://gerrit.wikimedia.org/r/555537

Change 555538 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs recdns post-decom cleanup

https://gerrit.wikimedia.org/r/555538

Change 555539 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] lvs recdns: get rid of legacy recursor hostnames

https://gerrit.wikimedia.org/r/555539

In a sample I just took across all recdns for a little over 15 minutes of sniffer time looking for requests to the legacy LVS-based recdns IPs:

ulsfo, eqsin, and esams had no traffic to them at all (yay! and makes basic sense)
eqiad had a handful of requests from:
- ps1-d7-eqiad.mgmt.eqiad.wmnet
- ps1-d2-eqiad.mgmt.eqiad.wmnet
- ps1-c1-eqiad.mgmt.eqiad.wmnet
codfw had more-interesting traffic from:
- ps1-a8-codfw.mgmt.codfw.wmnet
- ps1-22-ulsfo.mgmt.ulsfo.wmnet
- install2002.wikimedia.org
- kraz.wikimedia.org

The PDUs I kind of expected. IIRC some of them can't be updated easily, and honestly they're not a huge problem. Will dig a bit more on those other cases!

Change 555520 merged by BBlack:
[operations/dns@master] lvs recdns: switch DNS aliases to anycast

https://gerrit.wikimedia.org/r/555520

Dug into the odd cases from install2002 and kraz - the common pattern here is that there are some daemons in the world which both (a) parse /etc/resolv.conf for themselves because they use their own custom DNS client code and (b) don't ever re-read that file if it changes. A few of those are daemons we actually use, which happen to have not had their daemon (or the host) restarted since our resolv.conf was switched to the new recdns IP a few months ago (~Aug-Sept timeframe, it was rolled out at different times to different places).

In these particular cases, install2002 needed a squid3 daemon restart (done), and for the kraz case it's ircd (which is an old version of ircd-ratbox used for mw_rc_irc stuff (which I haven't restarted, because I'm not sure how fragile that stuff is)).

Next week I might do a much longer sniff (hours), and see if I can find any more such edge cases.

colewhite triaged this task as Medium priority.Dec 6 2019, 11:19 PM

Change 556177 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs recdns: eqiad and codfw keep old addr, for now

https://gerrit.wikimedia.org/r/556177

Change 556178 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs recdns: remove legacy IP definition, step 1

https://gerrit.wikimedia.org/r/556178

Change 556179 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs recdns: remove legacy IP definition, step 2

https://gerrit.wikimedia.org/r/556179

Mentioned in SAL (#wikimedia-operations) [2019-12-10T16:23:05Z] <bblack> cr[12]-codfw: Adding static route for 208.80.153.254 (legacy lvs recdns IP) to dns2002.wikimedia.org - T239993

Mentioned in SAL (#wikimedia-operations) [2019-12-10T16:25:23Z] <bblack> cr[12]-eqiad: Adding static route for 208.80.154.254 (legacy lvs recdns IP) to dns1002.wikimedia.org - T239993

Mentioned in SAL (#wikimedia-operations) [2019-12-10T16:37:40Z] <bblack> lvs* + dns*: puppet disabled for lvs recdns decom work - T239993

Change 555537 merged by BBlack:
[operations/puppet@production] lvs recdns: decom lvs-specific parts

https://gerrit.wikimedia.org/r/555537

Change 556177 merged by BBlack:
[operations/puppet@production] lvs recdns: eqiad and codfw keep old addr, for now

https://gerrit.wikimedia.org/r/556177

Change 555538 merged by BBlack:
[operations/puppet@production] lvs recdns: clean up realserver def

https://gerrit.wikimedia.org/r/555538

Change 556230 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] lvs recnds: remove last remaining revdns comments

https://gerrit.wikimedia.org/r/556230

Change 555539 merged by BBlack:
[operations/dns@master] lvs recdns: get rid of legacy recursor hostnames

https://gerrit.wikimedia.org/r/555539

Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but not the core ones. What remains is that the legacy LVS recdns IPs 208.80.154.254 (eqiad) and 208.80.153.254 (codfw) are still statically-configured to avoid breaking any of the leftover dependencies on these IPs. Sniffer monitoring has shown at least the ircd instance on kraz is still using outdated resolv.conf data and hitting these IPs, several hardware PDUs are using them as well, and there are possibly other such cases which are rarer and thus harder to observe in short samples (I've done up to 1h samples).

The static (as in non-LVS) configuration of these is puppetized, and the eqiad and codfw core routers have explicit static routes sending 208.80.154.254 to dns1002 and 208.80.153.254 to dns2002 (the 01 boxes are also acceptable backup targets if necessary). The routes in the juniper configs are tagged with a comment referencing this ticket.

Once we're sure we're ready to destroy these last remnants of service (after the holidays! and investigating the remaining PDUs situation and kraz and taking longer sniffs), what remains to finish decomming these and close this ticket up is:

Remove the manual routes referenced above from cr[12]-(eqiad|codfw)
Merge and deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556178/ (carefully one at a time on dns[12]00[12]), removing the service IP listeners and IP address defs)
Merge and deploy https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556179/ (doesn't have to be as careful, just cleans up the bits that remain after the above in the puppet sense)
Merge and deploy the DNS patch https://gerrit.wikimedia.org/r/#/c/operations/dns/+/556230/ (removes the last comment lines noting that these IPs are still in use)

jcrespo moved this task from Backlog to Acknowledged on the SRE board.Dec 11 2019, 5:30 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-20T16:53:05Z] <bblack> kraz.wikimedia.org ( https://wikitech.wikimedia.org/wiki/IRCD ) - stopping ircecho then ircd, then restarting them in reverse order - T239993

The kraz case is gone now (yay!) and hasn't recurred since the ircd restart above. What's left appears to be all infrastructure stuff: PDUs, switches, firewalls, etc. I've picked up quite a few of them in a few hours, so I'm going to let it run for a full 24h to try to capture them all, and then I'll make some sub-tasks to clean them up.

@BBlack FYI the step:

Merge and deploy the DNS patch https://gerrit.wikimedia.org/r/#/c/operations/dns/+/556230/ (removes the last comment lines noting that these IPs are still in use)

Has been done with the migration to Netbox's autogenerated data. If the IPs with final octet 254 should be reserved, they can be marked as such in Netbox.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@BBlack
Looking at router config I found:

/* Temporary for T239993 */
route 208.80.153.254/32 {
    next-hop 208.80.153.111;
    readvertise;
    no-resolve;
}

As 208.80.153.254 doesn't seem to exist anymore I guess it's safe to remove?

Change 556230 merged by BBlack:

[operations/dns@master] lvs recnds: remove last remaining revdns comments

https://gerrit.wikimedia.org/r/556230

In T239993#7467564, @ayounsi wrote:
@BBlack
Looking at router config I found:
/* Temporary for T239993 */
route 208.80.153.254/32 {
    next-hop 208.80.153.111;
    readvertise;
    no-resolve;
}
As 208.80.153.254 doesn't seem to exist anymore I guess it's safe to remove?

These IPs came up the other day in an IRC conversation. They do still exist in dns[12]00[12] puppetized live config. They're both still intended to still be functional, but they're lacking some accountability between the DNS repo and netbox. I went ahead and removed the last DNS repo reference to them, and I'm adding DNS names to them in netbox (and referencing this ticket), so that clears up the accounting issues a bit.

Before we can remove them from service (from router config, dns host config, and netbox), we have to eliminate the remaining device configs which are still relying on these IPs. The last hosts (e.g. kraz) were fixed a while back, but there are still infra devices using these IPs for DNS resolution (PDUs, switches, SRXs, etc). Haven't sniffed for what the remaining cases are lately, so should probably re-check and see how widespread the problem is anymore.

FTR - I did a quick 1-hour capture on these today just to see whether there was any sign of remaining cases, and there still are some. Probably a 24+ hour capture would get more of them, but the few that showed up in an hour today were, ps1-a5-codfw, ps1-d3-codfw, ps1-b2-codfw, ps1-22-ulsfo. Should probably at some point (no rush!) just audit the PDUs and other bits to see which still haven't been converted to using 10.3.0.1 for recdns.

ayounsi mentioned this in T295668: Update PDUs name-server config.Nov 15 2021, 9:35 AM

As there are a lot of PDUs, auditing them manually is quite time-consuming, so running a 24h capture and only updating the ones that show up seems more efficient.
I opened T295668 for DCops to update the one you found.

BBlack moved this task from Backlog to Revive/Active? on the Traffic-Icebox board.Apr 7 2022, 9:35 PM

@BBlack/@KOfori it has been a year now, can we remove the "temporary" static rules now from the routers? I'd like to keep our config lean and I worry this gets forgotten (I see it in Traffic-Icebox).

ayounsi added a subscriber: KOfori.Nov 30 2022, 8:01 AM

@ayounsi I've started a tcpdump on the dns hosts to see what devices are still reaching out. It's on our radar and I intend on addressing the remaining hosts (or poking dcops to do it for us if we cannot)!

BCornwall closed subtask T254178: Fix recdns config on various hardware devices as Resolved.Dec 21 2022, 3:50 PM

ayounsi claimed this task.Dec 22 2022, 7:44 AM

Assigning the task to myself to remove the router's static routes after the break.

Mentioned in SAL (#wikimedia-operations) [2023-01-10T07:03:10Z] <XioNoX> remove static routes for legacy dns-rec-lb IPs - T239993

Static routes removed!

Next step is to remove the IPs from the servers:
That means removing everything related to "legacy_vip" in Puppet
https://github.com/wikimedia/puppet/blob/production/hieradata/role/codfw/dnsbox.yaml
https://github.com/wikimedia/puppet/blob/production/hieradata/role/eqiad/dnsbox.yaml
https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/dns/recursor.pp
Then manually deleting the IP from the host (I don't think interface::ip can take care of removing it)

Then delete them from Netbox (their DNS records have already been removed by setting them to Deprecated):
https://netbox.wikimedia.org/ipam/ip-addresses/4168/
https://netbox.wikimedia.org/ipam/ip-addresses/4171/

Change 879107 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Remove all legacy_vip entries

https://gerrit.wikimedia.org/r/879107

Change 556178 abandoned by BBlack:

[operations/puppet@production] lvs recdns: remove legacy IP definition, step 1

Reason:

https://gerrit.wikimedia.org/r/556178

Change 556179 abandoned by BBlack:

[operations/puppet@production] lvs recdns: remove legacy IP definition, step 2

Reason:

https://gerrit.wikimedia.org/r/556179

Mentioned in SAL (#wikimedia-operations) [2023-01-11T18:52:20Z] <brett> Removing legacy vips from dns servers - T239993

Change 879107 merged by BCornwall: