Page MenuHomePhabricator

Setup zero touch provisioning (ZTP) for network devices
Open, MediumPublic

Description

The goal is to productionize @Papaul's efforts to perform zero touch provisioning of the network devices both on the Puppet side and with a Cookbook that will setup the DHCP side and then run Homer on the device to complete the setup.

The rough plan is to have:

Puppet

  • DHCP generic config to setthe ZTP values required for it to work (merged)
  • puppet/private: bash script to be executed by Juniper modules/secret/secrets/install_server/ztp-juniper.sh
  • labs/private: dummy script to be executed by Juniper (merged)

Spicerack

  • expand the existing spicerack.dhcp.DHCPConfMgmt class to be a bit more generic and match also this use case based on the vendor. (to be reviewed)

Cookbook

  • Create a new sre.network.provision cookbook that should take care of the process (merged). Current idea is to:
    • On Netbox assigns an IP to the em0 interface (must be there) and set its DNS record, disable all the other fixed interfaces
    • run the sre.dns.netbox cookbook
    • set the DHCP
    • polls for the device to be "up" in some way [TBD how exactly]
    • once the device is reachable the cookbook runs homer on it
    • final checks and clean up

Pre-requisite

The current proposed workflow requires only that the device is already in Netbox with name and location and thanks to the device type template also the fixed interfaces.
There are some details to be defined for the fixed interfaces, see T336485#8844448

Event Timeline

Volans triaged this task as Medium priority.Thu, May 11, 11:21 AM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 919037 had a related patch set uploaded (by Volans; author: Volans):

[labs/private@master] secrets: add ZTP script for install_server

https://gerrit.wikimedia.org/r/919037

Change 919037 merged by Volans:

[labs/private@master] secrets: add ZTP script for install_server

https://gerrit.wikimedia.org/r/919037

Thanks for filing this one! I'm happy with the script in the private repo, but I think it would help if @ayounsi also had a quick look.

Process

To summarize a chat from irc these are the steps @Volans expalined to me:

  • dcops creates the device on netbox with name, location, etc...
  • dcops runs the cookbook: sre.netbox.provision some-name
    • the cookbooks adds a em0 interface mgmt only to netbox (if missing) assigning it an IP and DNS record
    • the cookbook runs the dns cookbook
    • the cookbook sets up the dhcp
    • the cookbook polls for the device to be "up" in some way
    • once the device is reachable the cookbook runs homer
    • final checks and clenaup
Fixed interfaces

Broadly that looks good to me. I do wonder, however, about some elements that may cause problems. Specifically on the QFX5120 series there are two in-built interfaces, 'em1' (a second physical management int), and 'vme' ('virtual-management-ethernet'). We don't use either, but if they are not present in Netbox then Homer will try to remove them when it runs, which fails.

Previously when I hit this I just added them in Netbox and set to disabled. This is probably the best way forward, unless we want to modify Homer to automatically generate config for them based on platform.

In terms of how we get them into Netbox it strikes me that:

  1. The cookbook described above could also add them, set to disabled.
  2. They could be added to the 'device role' template for the specific platforms, so they are there by default when the device is created.

If doing the latter we might also consider whether to have 'em0' in the template, and indeed fixed-name interfaces like those for the QSFP ports. To be discussed.

Change 919052 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] dhcp: expand support for hostname based match

https://gerrit.wikimedia.org/r/919052

Change 919076 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] installserver: enable ZTP for network devices

https://gerrit.wikimedia.org/r/919076

Change 919159 had a related patch set uploaded (by Volans; author: Volans):

[labs/private@master] installserver: set dummy ZTP temporary root passwd

https://gerrit.wikimedia.org/r/919159

Change 919167 had a related patch set uploaded (by Volans; author: Volans):

[labs/private@master] Revert "secrets: add ZTP script for install_server"

https://gerrit.wikimedia.org/r/919167

Change 919159 merged by Volans:

[labs/private@master] installserver: set dummy ZTP temporary root passwd

https://gerrit.wikimedia.org/r/919159

Change 919167 merged by Volans:

[labs/private@master] Revert "secrets: add ZTP script for install_server"

https://gerrit.wikimedia.org/r/919167

@Volans i have some switches ready for testing. 2 leaves in different rows and the 2 spines
lsw1-a8
lsw1-b8
ssw1-a1
ssw1-a8

Change 919263 had a related patch set uploaded (by Volans; author: Volans):

[labs/private@master] installserver: rename temporary juniper ZTP passwd

https://gerrit.wikimedia.org/r/919263

Change 919263 merged by Volans:

[labs/private@master] installserver: rename temporary juniper ZTP passwd

https://gerrit.wikimedia.org/r/919263

Change 919276 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] install_server: simplify DHCP config

https://gerrit.wikimedia.org/r/919276

Change 919277 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] install_server: convert dhcpd.conf to template

https://gerrit.wikimedia.org/r/919277

Change 919282 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] install_server: remove mgmt subnet already managed

https://gerrit.wikimedia.org/r/919282

One question did occur to me, I'll mention it here but not sure we need to focus on it, at least initially.

Should we consider a mechanism to return the 'image-file-name" DHCP option, providing a link to a JunOS image?

option NEW_OP.image-file-name "/dist/images/jinstall-ex-4200-13.2R1.1-domestic-signed.tgz";

Potentially we could have a toggle for the cookbook? Where we either supply the filename ourselves, or it knows to grab the latest one we want based on platform?

Anyway maybe it's something to look at after we get the base functionality working. Could be helpful I think.

I've seen that option and decided that was not relevant for new host's ztp, but lmk if we need it too.
The general usage for that seems to me more for a "reimage" concept of upgrading junos, but for that we already have a cookbook, so not sure if we want to go into this direction.

The general usage for that seems to me more for a "reimage" concept of upgrading junos, but for that we already have a cookbook, so not sure if we want to go into this direction.

Yeah 100% that's a good comparison. And the prepare-upgrade cookbook does allow us to stage an OS image, so it's helpful, but the device still needs a manual reboot afterwards.

We pretty much always need to upgrade/downgrade the OS version for new switches as we are installing them, so I wonder if it might be worth having that as part of the process.

It would be useful during the initial provisioning to have the device running the Junos version we want on day 1. That would save some time from DCops and Netops I believe.

that string could for example be a Netbox custom field for device types (or derived from it). Probably better to track it in a different task.

Is it ok to start testing without it? Based on how we want the workflow to go we would need a change in Spicerack or in Netbox/cookbook

sgtm as it's an additional feature and to prevent scope creep but might be worth looking at implementing it sooner than later.

I 'do agree that we can also have the Junos image for upgrade during the process. Our first goal here was to have the ztp process for the configuration going once we test this and it is working we can focus on the Junos upgrade part. which will have to do with adding some lines to the dhcp server to tell the switch where to find the Junos image same like where to find the configuration script. Just keep in mine that for the OS images we will have to define different images based on the switch model something like below

# Classes for different type of devices

class "juniper-ex4200" {
       match if option vendor-class-identifier = "Juniper-ex4200-48t";
       vendor-option-space ZTP;
#       option ZTP.image-file-name "jinstall-ex-4200-15.1R7-S8.1-domestic-signed.tgz";
         option ZTP.image-file-name "jinstall-ex-4200-15.1R7.9-domestic-signed.tgz";
}

class "juniper-srx300" {
       match if option vendor-class-identifier = "Juniper-srx300";
       vendor-option-space ZTP;
       option ZTP.image-file-name "junos-srxsme-12.3X48-D105.4-domestic.tgz";

}

Do we want to hardcode that in the dhcp settings? Or better to pass it dynamically to the cookbook?
Based on that it's a change in spicerack or puppet... so I'll need to know

Change 919052 merged by jenkins-bot:

[operations/software/spicerack@master] dhcp: expand support for hostname based match

https://gerrit.wikimedia.org/r/919052

Do we want to hardcode that in the dhcp settings? Or better to pass it dynamically to the cookbook?

I've mixed feelings on that one. In my mind what might be ideal (and as Arzhel says this can all wait until we've the basics up and running) would be if we:

  • Add a custom_field in Netbox on each "device_type", called something like "default_sw_image"
  • Cookbook by default looks up device.device_type.custom_fields.default_sw_image in Netbox, and uses that filename in the DHCP snippet it creates
  • The cookbook has a parameter that can be used to manually override this image name.

Just a thought though. Esp on the last point not sure if we need that much control.

Change 919793 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.network.provision: add new cookbook

https://gerrit.wikimedia.org/r/919793

Change 919076 merged by Volans:

[operations/puppet@production] installserver: enable ZTP for network devices

https://gerrit.wikimedia.org/r/919076

Mentioned in SAL (#wikimedia-operations) [2023-05-15T13:33:24Z] <volans> disabling puppet on the install hosts to deploy changes for T336485

Change 919276 merged by Volans:

[operations/puppet@production] install_server: simplify DHCP config

https://gerrit.wikimedia.org/r/919276

Change 919277 merged by Volans:

[operations/puppet@production] install_server: convert dhcpd.conf to template

https://gerrit.wikimedia.org/r/919277

Change 919282 merged by Volans:

[operations/puppet@production] install_server: remove mgmt subnet already managed

https://gerrit.wikimedia.org/r/919282

I've tested a reimage of a physical host and worked fine, we still have a bit of duplication of requests, do you spot anything anomalous? (like the unknown network):

May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 198.18.0.1: unknown network segment
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 10.64.48.3
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPOFFER on 10.64.48.138 to d0:94:66:5f:67:20 via 10.64.48.3
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 198.18.0.1: unknown network segment
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 10.64.48.3
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPOFFER on 10.64.48.138 to d0:94:66:5f:67:20 via 10.64.48.3
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 10.64.48.3
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPOFFER on 10.64.48.138 to d0:94:66:5f:67:20 via 10.64.48.3
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 10.64.48.2
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPOFFER on 10.64.48.138 to d0:94:66:5f:67:20 via 10.64.48.2
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 10.64.48.2
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPOFFER on 10.64.48.138 to d0:94:66:5f:67:20 via 10.64.48.2
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPDISCOVER from d0:94:66:5f:67:20 via 10.64.48.2
May 15 13:48:56 install1004 dhcpd[1813999]: DHCPOFFER on 10.64.48.138 to d0:94:66:5f:67:20 via 10.64.48.2

Mentioned in SAL (#wikimedia-operations) [2023-05-15T13:56:58Z] <volans> re-enabled puppet on the install hosts to deploy changes for T336485

Change 919793 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.provision: add new cookbook

https://gerrit.wikimedia.org/r/919793

@Volans is it possible to have a full pcap of those unknown network segment ?

I've run some test on ssw1-a1-codfw, fixed a couple of minor typos that prevented the cookbook to get to the DHCP part and not it runs fine up to there. I've found another small bug there that puts the IP with the subnet instead of wihout, I'll fix it next day as I have to run right now. Will update on next tests.

Change 920349 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.network.provision: bugfix and improvements

https://gerrit.wikimedia.org/r/920349

Change 920349 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.provision: bugfix and improvements

https://gerrit.wikimedia.org/r/920349

Change 920366 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.network.provision: allow to retry polling

https://gerrit.wikimedia.org/r/920366

Change 920366 merged by jenkins-bot:

[operations/cookbooks@master] sre.network.provision: allow to retry polling

https://gerrit.wikimedia.org/r/920366

Change 920374 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] install_server: fix ztp-juniper script

https://gerrit.wikimedia.org/r/920374

Result of the testing with Cathal.
I first want to thank @cmooney for all the help with JunOS-magics, that was precious and fundamental to find the culprit of why it was not working.

ssw1-a1-codfw could not be properly tested as it has already a password and it's not in the ZTP DHCP loop like others. I'm not sure what was done there, @Papaul do you know? With Cathal we've tried all the known passwords and couldn't login even.

We've run the tests against ssw1-a8-codfw that also had the console. After some trial and error and fixing minor bugs we've found the correct configuration that lets the cookbook complete and the device load and configure itself with ZTP (see the various patches above).

One thing that we had to overcome is that from the management network there is no access to the TFTP servers. HTTP is already open to the apt.w.o servers so we could try that approach, although Juniper docs says that the DNS is not supported and so it might not work. The other option is to open (with maybe some restriction) to the TFTP servers (install servers). @ayounsi thoughts?

We also have played a bit with iptables to exclude some things but most likely there is nothing to touch there.

Awesome work getting it working @Volans big thanks to you too :)

HTTP is already open to the apt.w.o servers so we could try that approach, although Juniper docs says that the DNS is not supported and so it might not work.

It might be worth a try, I notice the first time I used the "route" command in the BSD shell it showed the name of the mgmt router, presumably got from a PTR lookup:

# route get
Destination               Type Nexthop                  
default                   user mr1-codfw.mgmt.codfw.wmne

Or we could see if we can generate the URL with the IPv4 address of the apt server in it. Not sure if the apt server will respond to such a request? Does the hostname in the HTTP header have to match?

We also have played a bit with iptables to exclude some things but most likely there is nothing to touch there.

Yeah I'm unsure about that. If we go down the tftp route we may need to add a rule on the install server to allow tftp from mgmt ranges.

Change 920374 merged by Volans:

[operations/puppet@production] install_server: fix ztp-juniper script

https://gerrit.wikimedia.org/r/920374

@Volans I had a word with @ayounsi on this and we both feel if we can make it work via HTTP to the apt server that's probably best.

I confirmed the apt server works fine with an IP instead of the hostname in there, like http://208.80.154.30/xxxx, so that is potentially an option.

I'd guess Juniper are correct that we can't use a domain-name in the URL we return in DHCP. But perhaps worth a shot? Otherwise I suppose we can try to return the IP in the URL, is there a way we can dynamically do that based on the current A record for apt.wikimedia.org? Or would we need to hard code the IP?

If there is any major blocker I think tftp is acceptable though.

Ok that sounds like a plan, let's try first if the FQDN link works and if not we'll fallback to the IP. Based on the test we might add this feature in different places (cookbook vs spicerack) but it should be easily testable manually first and decide based on that.

From previous tests, DNS resolution is not available at this stage on Junos devices, so better to go directly with the IP.