Page MenuHomePhabricator

Manage DHCP from Netbox
Closed, ResolvedPublic

Description

New servers provisioning currently need a Puppet commit to statically set the primary NIC's MAC/hostname/IP mapping in the DHCP config.
This takes time and is error prone. A better option would be to generate it from Netbox data.
The only data not present in Netbox is option pxelinux.pathprefix used when a host needs a different Debian version than the default one.

Physical hosts

Prerequisite is to configure DHCP option 82 (see T221388, and an example of dhcp config), this feature adds the switch interface and interface description to the DHCP request. At this point the switch port is already configured based on Netbox and enabled with Homer.

Then there are several options to configure DHCP. Feel free to add any other options or complete them.
(From a previous discussion with Riccardo)

Cron to generate the config on the install servers

Prerequisite is to add a custom field to the devices, to specify if they need a different than default option pxelinux.pathprefix.
Then have a script that runs regularly and fetch all servers with a planned status as well as a connected cable and generate the matching DHCP configuration.
Downsides are:

  • need of an extra Netbox field
  • regular querying of Netbox (slow) API
  • have to wait for the cron to run before booting the host

Upsides:

  • Probably the easiest to setup
  • works for multiple hosts in parallel

Generate the config on Netbox hosts and cron to pull it

Slightly similar as above, but workaround the Netbox API limitation by having the config pushed to a git repo (or fetchable via https) on the Netbox hosts.

Cookbook

Pass the hostname(s) and (if not default) option pxelinux.pathprefix as command line arguments.
It will then generate the relevant DHCP config and push them to the relevant install server. Maybe even run the DHCP and display its logs?
Then maybe pause until the operator continues the script and cleanup the config once done.

Upsides:

  • Less hard on the Netbox API
  • No need for an extra Netbox field
  • More control over what's generated
  • Can do other checks on the way (maybe

Downside:

  • More complex to setup?
  • Yet another cookbook to run?
  • Might cause race conditions if several persons provision several hosts in parallel?

DHCP Hooks

Prerequisite: upgrade to ISC-DHCP to ISC-KEA, as I can't find a similar feature for ISC-DHCP.

Kea has hooks that could potentially query Netbox in real time. The risk is that Netbox takes too much time to run and the hosts DHCP request times out.

VMs

As we can sync MAC addresses from Ganeti to Netbox, the easier might be to generate a similar config as we have now (so no need of option 82).
And the sre.ganeti.makevm cookbook could take care of updating the config with either one of the choice made above (either force the cron/fetch/etc or generate/push the config directly).

Event Timeline

Change 662641 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add option-82 to prod vlans

https://gerrit.wikimedia.org/r/662641

Change 662641 merged by jenkins-bot:
[operations/homer/public@master] Add option-82 to prod vlans

https://gerrit.wikimedia.org/r/662641

Mentioned in SAL (#wikimedia-operations) [2021-02-08T16:30:10Z] <XioNoX> adding option-82 to all prod vlans DHCP - T269855

Bonus thought: should it distribute v6 IPs as well? That way less hack during host bootstrapping. And it paves the way to a v6 only future.

Bonus thought: should it distribute v6 IPs as well? That way less hack during host bootstrapping. And it paves the way to a v6 only future.

I think this is worth exploring as it gets rid of the interface tokenisation hack we do in d-i and i think would also allow us to drop the add_ipv6_mapped define all together. would be interested in @BBlack thoughts?

Change 675932 had a related patch set uploaded (by CRusnov; author: CRusnov):

[operations/software/spicerack@master] dhcp: Add module for manipulating dynamic DHCP entries

https://gerrit.wikimedia.org/r/675932

Change 675932 merged by CRusnov:

[operations/software/spicerack@master] dhcp: Add module for manipulating dynamic DHCP entries

https://gerrit.wikimedia.org/r/675932

joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 4:04 PM
Volans triaged this task as Medium priority.

Change 727387 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] dhcp: remove all physical hosts hardcoded config

https://gerrit.wikimedia.org/r/727387

Change 727411 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.experimental.reimage: remove legacy code

https://gerrit.wikimedia.org/r/727411

Change 727412 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: renamed from experimental

https://gerrit.wikimedia.org/r/727412

Change 727415 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] cumin: remove wmf-auto-reimage scripts

https://gerrit.wikimedia.org/r/727415

Mentioned in SAL (#wikimedia-operations) [2021-10-11T07:58:17Z] <volans> migrating physical hosts DHCP to the new reimage process - T269855

Change 727387 merged by Volans:

[operations/puppet@production] dhcp: remove all physical hosts hardcoded config

https://gerrit.wikimedia.org/r/727387

Change 727415 merged by Volans:

[operations/puppet@production] cumin: remove wmf-auto-reimage scripts

https://gerrit.wikimedia.org/r/727415

Change 727415 merged by Volans:

[operations/puppet@production] cumin: remove wmf-auto-reimage scripts

https://gerrit.wikimedia.org/r/727415

The following files have been manually removed from the cumin hosts (cumin[2001-2002].codfw.wmnet,cumin1001.eqiad.wmnet):

  • /usr/local/sbin/wmf-auto-reimage-host
  • /usr/local/sbin/wmf-auto-reimage
  • /usr/local/lib/python3.9/dist-packages/wmf_auto_reimage_lib.py or /usr/local/lib/python3.7/dist-packages/wmf_auto_reimage_lib.py

Change 727411 merged by jenkins-bot:

[operations/cookbooks@master] sre.experimental.reimage: remove legacy code

https://gerrit.wikimedia.org/r/727411

Change 727412 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: renamed from experimental

https://gerrit.wikimedia.org/r/727412

Removed the old directory for the renamed cookbook:

$ sudo cumin 'A:cumin' 'rm -rfv /srv/deployment/spicerack/cookbooks/sre/experimental'
3 hosts will be targeted:
cumin[2001-2002].codfw.wmnet,cumin1001.eqiad.wmnet
Ok to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit 3
===== NODE GROUP =====
(1) cumin2002.codfw.wmnet
----- OUTPUT of 'rm -rfv /srv/dep...sre/experimental' -----
removed '/srv/deployment/spicerack/cookbooks/sre/experimental/__pycache__/reimage.cpython-39.pyc'
removed '/srv/deployment/spicerack/cookbooks/sre/experimental/__pycache__/__init__.cpython-39.pyc'
removed directory '/srv/deployment/spicerack/cookbooks/sre/experimental/__pycache__'
removed directory '/srv/deployment/spicerack/cookbooks/sre/experimental'
===== NODE GROUP =====
(1) cumin1001.eqiad.wmnet
----- OUTPUT of 'rm -rfv /srv/dep...sre/experimental' -----
removed '/srv/deployment/spicerack/cookbooks/sre/experimental/__pycache__/__init__.cpython-37.pyc'
removed '/srv/deployment/spicerack/cookbooks/sre/experimental/__pycache__/reimage.cpython-37.pyc'
removed directory '/srv/deployment/spicerack/cookbooks/sre/experimental/__pycache__'
removed directory '/srv/deployment/spicerack/cookbooks/sre/experimental'
================
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:00<00:00,  3.09hosts/s]
FAIL |                                                                                                                                 |   0% (0/3) [00:00<?, ?hosts/s]
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'rm -rfv /srv/dep...sre/experimental'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host sretest1001.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin2002 for host sretest1002.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host sretest1001.eqiad.wmnet completed:

  • sretest1001 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110824_volans_27007_sretest1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin2002 for host sretest1002.eqiad.wmnet completed:

  • sretest1002 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110840_volans_2465396_sretest1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 730030 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] dhcp: add support for MAC address based config

https://gerrit.wikimedia.org/r/730030

As the /etc/dhcpd/ directory managed by Puppet doesn't have purge=true, I've also removed the old file with the hardcoded MAC addresses.

$ sudo cumin 'A:installserver-light' 'rm -fv /etc/dhcp/linux-host-entries.ttyS1-115200'
5 hosts will be targeted:
install[1003,2003,3001,4001,5001].wikimedia.org
Ok to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit 5
===== NODE GROUP =====
(5) install[1003,2003,3001,4001,5001].wikimedia.org
----- OUTPUT of 'rm -fv /etc/dhcp...ies.ttyS1-115200' -----
removed '/etc/dhcp/linux-host-entries.ttyS1-115200'
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [00:02<00:00,  1.94hosts/s]
FAIL |                                                                                                                                           |   0% (0/5) [00:02<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'rm -fv /etc/dhcp...ies.ttyS1-115200'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

cookbooks.sre.hosts.decommission executed by volans@cumin2002 for hosts: testvm2009.codfw.wmnet

  • testvm2009.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by volans@cumin2002 for hosts: testvm2009.codfw.wmnet

  • testvm2009.codfw.wmnet (FAIL)
    • Host steps raised exception:

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by volans@cumin2002 for hosts: testvm2009.codfw.wmnet

  • testvm2009.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

Change 730416 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] install_server: uniform DHCP snippet automation

https://gerrit.wikimedia.org/r/730416

Change 730030 merged by jenkins-bot:

[operations/software/spicerack@master] dhcp: add support for MAC address based config

https://gerrit.wikimedia.org/r/730030

Change 730416 merged by Volans:

[operations/puppet@production] install_server: uniform DHCP snippet automation

https://gerrit.wikimedia.org/r/730416

The automation of the DHCP for physical hosts has been completed and there are no more MAC addresses hardcoded in Puppet for those hosts.
See all the details in: https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_Automation

The remaining bits for virtual hosts in Ganeti is tracked in T297133