Page MenuHomePhabricator

Automation to change a server's vlan
Open, MediumPublic

Description

Forking the Renumbering section of T327938: Codfw row A/B top-of-rack switch refresh to its own task:

To allow for renumbering some development will need to happen to support a "--renumber" toggle for the reimage cookbook, which should delete the hosts existing IP allocation and add a new one.

Renumbering presents additional challenges in terms of services running on the hosts, if they come back online with different IPs. A few things we need to consider (there are likely more):

  • DNS needs to be updated, old entries can still be in DNS caches
    • Is it possible to change the DNS TTLs in advance to help us here?
  • We may have hardcoded IPs in puppet for certain things. Possibly the renumbering script could perform a git grep of the IP in multiple repositories to look for these (like the decommissioning cookbook):
    • Puppet
    • Puppet private
    • Mediawiki-config
    • Deployment charts
    • homer-public
  • DNS record resolved at catalog compile time by the Puppet master and those resolved for example by ferm at reload time (but could be any other service) will need update either forcing a puppet master or with a ferm reload or with a specific service reload/restart.
  • Databases:
    • DB grants are issued per-IP
    • mediawiki connects to the DB via IP
    • dbctl has the IPs of the servers and gives it to the mediawiki config stored in etcd
    • Backend servers behind LVS: TBD
    • Ganeti servers: depends on the whole Ganeti discussion

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There will be special usecase, but if we can tackle all the regular servers (eg. 1 uplink, 1 IP, 1 , then we will be in a great spot.

The ideal/cleanest is to go through a re-image, but that might not be easy/doable for all the hosts, so we should look at a "in-place" renumbering as well.

Then the automation should all be through a single cookbook for ease of use, especially as it should be ran by any service owner.

something like sre.hosts.renumber <hostname>.

First with a --check or --preflight parameter that will print out a "report" of if the host can be easily renumbered or not:

  • Does it have multiple IPs?
  • Is it in a vlan that requires/supports a re-numbering?
  • Does it have hardcoded IPs (v4 or v6) in any repositories ?

Then either with a --reimage or --inplace parameter.

That will first run checks:

  • Does it have multiple IPs?
  • Is it in a vlan that requires/supports a re-numbering?

Then start working on the renumbering:

  • If possible depool the server, or ask if it has been depooled to continue
  • Downtime in monitoring
  • Update the IP and vlan in Netbox, display the IP to the user for the next step (optionally still marks the old IP as reserved for easier rollback)
  • Pause to let the user update the hard-coded IPs the cookbook was able to find (or force a yes to continue). Once done the user types yes to continue, if not it reverts the IP/vlan allocation.

If --reimage:

  • Run the sre.network.configure-switch-interfaces cookbook
  • Run the re-image cookbook

If --inplace:

  • Replace the IPs in /etc/network/interfaces, reload the networking service
  • Run the sre.network.configure-switch-interfaces cookbook
  • Run Puppet to update PuppetDB, run the ImportPuppetDB Netbox script.

I'm most likely missing some steps, but that should be a good base to discuss the cookbook's implementation.

Another approach, more complex and I'm not sure if that useful is:
Add a --reserve parameter to the cookbook.
When used, the cookbook will only reserve a v4 and v6 IP, so service owners can pre-populate any ACL with the future IP where possible.

Then during the renumbering run, the cookbook can check if there is a matching reserved IP and use it.

Some random additions:

  • I would probably add a grep for the IP on at least /etc on the host too to check if it's hardcoded somewhere else in addition to /etc/nework/interface.
  • Add a call to the wipe-cache cookbook to cleanup the direct and reverse DNS records in all recursors

I'm not sure how safe would be the --in-place procedure without even a reboot and with potentially other services that will need to detect the new IP and maybe are not (not respecting DNS TTL, not refreshing their data, etc..). Things like pybal or cassandra, etc...

Change 979040 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/spicerack@master] Netbox module: add get/set for primary IPs and access vlan

https://gerrit.wikimedia.org/r/979040

Change 979121 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/spicerack@master] Netbox: add generic function to execute a Netbox script

https://gerrit.wikimedia.org/r/979121

Change 981349 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] Move git search related classes to __init__

https://gerrit.wikimedia.org/r/981349

Change 981472 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] [WIP] Cookbook to renumber a host while moving its vlan

https://gerrit.wikimedia.org/r/981472

  • I would probably add a grep for the IP on at least /etc on the host too to check if it's hardcoded somewhere else in addition to /etc/nework/interface.

I had a look at multiple servers and when there is a hardcoded value it's set by Puppet, so running puppet once or twice after the re-imaging should be enough. Otherwise we will have many false positives.

  • Add a call to the wipe-cache cookbook to cleanup the direct and reverse DNS records in all recursors

Done

I'm not sure how safe would be the --in-place procedure without even a reboot and with potentially other services that will need to detect the new IP and maybe are not (not respecting DNS TTL, not refreshing their data, etc..). Things like pybal or cassandra, etc...

Yeah, the in-place is more complex than the reimage, and should only be done when re-image is not possible.

More checks or direct update should be done to cover all the edge cases, but if we can already automate most of it it would be a great step forward.

Change 981349 merged by jenkins-bot:

[operations/cookbooks@master] Move git search related classes to __init__

https://gerrit.wikimedia.org/r/981349

Change 979121 merged by jenkins-bot:

[operations/software/spicerack@master] Netbox: add generic function to execute a Netbox script

https://gerrit.wikimedia.org/r/979121

Change 979040 merged by jenkins-bot:

[operations/software/spicerack@master] Netbox module: add get/set for primary IPs and access vlan

https://gerrit.wikimedia.org/r/979040

Mentioned in SAL (#wikimedia-operations) [2024-02-29T10:24:43Z] <claime> Cordoning kubernetes2023.codfw.wmnet for vlan change cookbook tests - T350152

@ayounsi I've drained kubernetes2023.codfw.wmnet for you to test the cookbook

Change 1007652 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: add support for VLAN move

https://gerrit.wikimedia.org/r/1007652

Mentioned in SAL (#wikimedia-operations) [2024-05-22T14:33:24Z] <jayme> drained, cordoned and pooled=inactive kubernetes2023 and kubernetes2032 for cookbook testing - T350152 T365571

Change #981472 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.move-vlan: add new cookbook

https://gerrit.wikimedia.org/r/981472

Change #1007652 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: add support for VLAN move

https://gerrit.wikimedia.org/r/1007652

Change #1036642 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] move-vlan: remove unused variable definition

https://gerrit.wikimedia.org/r/1036642

Change #1036642 merged by jenkins-bot:

[operations/cookbooks@master] move-vlan: remove unused variable definition

https://gerrit.wikimedia.org/r/1036642