Page MenuHomePhabricator

Core routers: replace bootp with dhcp-relay
Closed, ResolvedPublic

Description

The SRXs are moving toward dhcp-relay with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/841460 and the QFX switches already use that feature.
bootp is not an option anymore on multiple platforms

To keep the configuration consistent we should migrate the core routers to use dhcp-relay

Related Objects

StatusSubtypeAssignedTask
Resolvedcmooney
ResolvedPapaul

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Agreed we should add it to the CRs, no reason I can think of not to.

Also I'll think about it in terms of the l3_switch template consolidation. They should get the same but have an 'if' around the section to add the option 82 info:

cmooney@asw1-b13-drmrs> show configuration forwarding-options dhcp-relay relay-option-82 
circuit-id {
    prefix {
        host-name;
    }
}

Marking this task dependent on DHCP option 97 to reduce the risk of DHCP oddities related to Option 82.

Change 905946 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp

https://gerrit.wikimedia.org/r/905946

Change 905946 abandoned by Ayounsi:

[operations/homer/public@master] cr: switch bootp to dhcp-relay; asw-drmrs: manage dhcp

Reason:

Fixed in If0a55fa9b9ea038e77ceb896c7a0046f24cae884

https://gerrit.wikimedia.org/r/905946

Change 908346 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Automate and update DHCP relay configuration

https://gerrit.wikimedia.org/r/908346

Change 908346 merged by jenkins-bot:

[operations/homer/public@master] Automate and update DHCP relay configuration

https://gerrit.wikimedia.org/r/908346

Mentioned in SAL (#wikimedia-operations) [2023-05-18T11:56:25Z] <topranks> reconfiguring DHCP relay function on eqiad core routers (T320508)

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm

Change 921054 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add trust-option-82 to dhcp relay conf for core routers

https://gerrit.wikimedia.org/r/921054

Change 921054 merged by jenkins-bot:

[operations/homer/public@master] Add trust-option-82 to dhcp relay conf for core routers

https://gerrit.wikimedia.org/r/921054

Marking this task dependent on DHCP option 97 to reduce the risk of DHCP oddities related to Option 82.

Ironic I hadn't seen this comment until today for some reason. Then spent the whole morning and afternoon dealing with DHCP oddities related to Option 82 :)

Anyway things are now working good with the 'dhcp relay' config on the core routers in place of the old 'bootp' setup. An additional piece of config is required on the core routers we don't have on the L3 switches - to trust DHCP packets with option 82 already set. That's merged now and reimage is working in tests.

Anyway we seem to be good here as things are, so we can remove the dependency and work on the Option 97 stuff separately.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye completed:

  • sretest1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305181553_cmooney_1698881_sretest1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB