Page MenuHomePhabricator

Test dhcp-option 82
Closed, ResolvedPublic0 Estimated Story Points

Description

Added the following to asw2-b-eqiad:

[edit vlan cloud-hosts1-b-eqiad]
+    forwarding-options {
+        dhcp-security {
+            option-82 {
+                remote-id {
+                    host-name;
+                }
+            }
+        }
+    }

Ran tcpdump on install1002 sudo tcpdump -s0 -w dhcp-82.pcap port 67

We can see the new option being added:

Option: (82) Agent Information Option
    Length: 60
    Option 82 Suboption: (1) Agent Circuit ID
        Length: 32
        Agent Circuit ID: 78652d342f302f33392e303a636c6f75642d686f73747331...
        (Which translate to xe-4/0/39.0:cloud-hosts1-b-eqiad)
    Option 82 Suboption: (2) Agent Remote ID
        Length: 24
        Agent Remote ID: 617377322d622d65716961643a78652d342f302f33392e30
       (Which translates to (asw2-b-eqiad:xe-4/0/39.0)

Event Timeline

ayounsi triaged this task as Low priority.
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans claimed this task.

Re-opening as we're aiming to implement it this quarter.

Mentioned in SAL (#wikimedia-operations) [2021-02-03T15:32:42Z] <volans> disabling puppet on install1003 for a quick test for T221388

Mentioned in SAL (#wikimedia-operations) [2021-02-03T16:13:14Z] <volans> enabled puppet on install1003 after the test T221388

I tested the config with:

host sretest1001 {
    host-identifier option agent.circuit-id "ge-3/0/15.0:private1-d-eqiad";
     fixed-address sretest1001.eqiad.wmnet;
}

And it seemed to work as expected. I need to perform a more in depth test as I want to reimage sretest1001 with the same config so to ensure that this works across the whole reimage process, but looks promising.
Another bits to look at is the Junos side of the configuration.
The tested configuration was:

ayounsi@asw2-d-eqiad# show | compare 
[edit vlans private1-d-eqiad]
+    forwarding-options {
+        dhcp-security {
+            option-82 {
+                remote-id {
+                    host-name;
+                }
+            }
+        }
+    }

We can check if we could have the switch hostname as part of the injected data.

Adding the circuit-id prefix host-name setting and removing the remote-id that we're not gonna use, the circuit ID includes the switch hostname too, so becoming asw2-d-eqiad:ge-3/0/15.0:private1-d-eqiad. That should be enough to be able to set the DHCP with in a unique manner.

References used:
https://www.juniper.net/documentation/en_US/junos/topics/topic-map/example-setting-up-dhcp-option82-no-relay.html#id-setting-up-dhcp-option-82-on-the-switch-with-no-relay-els

http://www.miquels.cistron.nl/isc-dhcpd/

The config for the above test used:

circuit-id {
    prefix {
        host-name;
    }
}

Change 663233 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] dhcpd: create and include files for option 82

https://gerrit.wikimedia.org/r/663233

Change 663234 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] dhcpd: move sretest1002 to option 82

https://gerrit.wikimedia.org/r/663234

Change 663233 merged by Volans:
[operations/puppet@production] dhcpd: create and include files for option 82

https://gerrit.wikimedia.org/r/663233

Change 663234 merged by Volans:
[operations/puppet@production] dhcpd: move sretest1002 to option 82

https://gerrit.wikimedia.org/r/663234

With the above patches merged, and with:

root@install1003:/etc/dhcp# cat opt82-entries.ttyS1-115200
host sretest1002 {
    host-identifier option agent.circuit-id "asw2-d-eqiad:ge-6/0/5.0:private1-d-eqiad";
    fixed-address sretest1002.eqiad.wmnet;
}

The DHCP requests from sretest1002 seems to work fine. I'll shortly test also a reimage.

I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of the vlan (e.g. scenarios where we might do vlan trunking on the main interface of the host and need to see or match that primary-vlan number in some interface setup scripts?)

From the doc:

Specify that the circuit ID suboption value contains the VLAN ID rather than the VLAN name (the default):

[edit vlans vlan-name forwarding-options dhcp-security option-82]
user@switch# set circuit-id use-vlan-id

So it seems possible.

I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of the vlan (e.g. scenarios where we might do vlan trunking on the main interface of the host and need to see or match that primary-vlan number in some interface setup scripts?)

@BBlack The option on the JunOS side allows to pick either the name or the ID, not both. In terms of assured uniqueness we can surely use the ID if the name can be duplicated (I don't think that Netbox enforces it).

Script wmf-auto-reimage was launched by volans on cumin1001.eqiad.wmnet for hosts:

sretest1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103011745_volans_18022_sretest1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['sretest1002.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by volans on cumin1001.eqiad.wmnet for hosts:

sretest1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103020952_volans_22946_sretest1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['sretest1002.eqiad.wmnet']

and were ALL successful.

Change 668041 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] dhcp: specify owner and group for option 82 config

https://gerrit.wikimedia.org/r/668041

Change 668041 merged by Volans:
[operations/puppet@production] dhcp: specify owner and group for option 82 config

https://gerrit.wikimedia.org/r/668041

Change 674574 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Option 82: use-vlan-id

https://gerrit.wikimedia.org/r/674574

Change 674574 merged by jenkins-bot:
[operations/homer/public@master] Option 82: use-vlan-id

https://gerrit.wikimedia.org/r/674574

Change 675932 had a related patch set uploaded (by CRusnov; author: CRusnov):

[operations/software/spicerack@master] dhcp: Add module for manipulating dynamic DHCP entries

https://gerrit.wikimedia.org/r/675932

I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of the vlan (e.g. scenarios where we might do vlan trunking on the main interface of the host and need to see or match that primary-vlan number in some interface setup scripts?)

@BBlack The option on the JunOS side allows to pick either the name or the ID, not both. In terms of assured uniqueness we can surely use the ID if the name can be duplicated (I don't think that Netbox enforces it).

I have just tested this and it seems that there is currently an issue in that the DHCPOFFER uses the vlan name but the DHCPREQUEST uses the vlan-id this causes isc-dhcp to fail.

Currently the swtiches send a DHCPOFFER which use the vlan-name, this means that if one specifics the vlan-id in DHCP config e.g. asw2-d-eqiad:ge-6/0/5.0:1020 then isc-dhcp dosn't even send an DHCPOFFER and you see the following in the logs

DHCPDISCOVER from 4c:d9:8f:60:af:47 via 10.64.48.3: network 10.64.48.0/22: no free leases

If we use the vlan name in isc-dhcp config e.g. asw2-d-eqiad:ge-6/0/5.0:private1-d-eqiad then we do recive a DHCPOFFER however the DHCPREQUEST which is sent by the switches now use the vlan-id this causes the following error in the logs

DHCPREQUEST for 10.64.48.139 (208.80.154.32) from 4c:d9:8f:60:af:47 via 10.64.48.3: unknown lease 10.64.48.1

I couldn't logon to the switch so not sure if this is a config issues, perhaps not updating all the bits however it smells a bit more like a bug to me (i.e. you shouldn't be able to configure this behaviour as its borked)

I have left a tcpdump created from trying to image sretest1002 in install1003:/home/jbond/dhcp_sretest1002.pcap however see below for the main bits

DHCPDISCOVER
14:50:21.675682 IP (tos 0x0, ttl 64, id 13544, offset 0, flags [none], proto UDP (17), length 633)
    10.64.48.3.bootps > install1003.wikimedia.org.bootps: [udp sum ok] BOOTP/DHCP, Request from 4c:d9:8f:60:af:47 (oui Unknown), length 605, hops 1, xid 0x9060af47, secs 4, Flags [Broadcast] (0x8000)
          Gateway-IP 10.64.48.3
          Client-Ethernet-Address 4c:d9:8f:60:af:47 (oui Unknown)
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
    ....
            Agent-Information Option 82, length 54: 
              Circuit-ID SubOption 1, length 40: asw2-d-eqiad:ge-6/0/5.0:private1-d-eqiad
              Remote-ID SubOption 2, length 10: ge-6/0/5.0
            END Option 255, length 0
            PAD Option 0, length 0, occurs 156
            xT21690 Option 21690, length 70: [|rfc1048 70]
DHCPOFFER
14:50:21.686454 IP (tos 0x0, ttl 64, id 40970, offset 0, flags [DF], proto UDP (17), length 467)
    install1003.wikimedia.org.bootps > 10.64.48.3.bootps: [bad udp cksum 0xa684 -> 0xf4a3!] BOOTP/DHCP, Reply, length 439, hops 1, xid 0x9060af47, secs 4, Flags [Broadcast] (0x8000)
          Your-IP 10.64.48.139
          Server-IP install1003.wikimedia.org
          Gateway-IP 10.64.48.3
          Client-Ethernet-Address 4c:d9:8f:60:af:47 (oui Unknown)
          file "lpxelinux.0"
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
...
            Agent-Information Option 82, length 54: 
              Circuit-ID SubOption 1, length 40: asw2-d-eqiad:ge-6/0/5.0:private1-d-eqiad
              Remote-ID SubOption 2, length 10: ge-6/0/5.0
            END Option 255, length 0
DHCPREQUEST
14:50:25.769773 IP (tos 0x0, ttl 64, id 15669, offset 0, flags [none], proto UDP (17), length 620)
    10.64.48.3.bootps > install1003.wikimedia.org.bootps: [udp sum ok] BOOTP/DHCP, Request from 4c:d9:8f:60:af:47 (oui Unknown), length 592, hops 1, xid 0x9060af47, secs 4, Flags [Broadcast] (0x8000)
          Gateway-IP 10.64.48.3
          Client-Ethernet-Address 4c:d9:8f:60:af:47 (oui Unknown)
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
...
            Agent-Information Option 82, length 42: 
              Circuit-ID SubOption 1, length 28: asw2-d-eqiad:ge-6/0/5.0:1020
              Remote-ID SubOption 2, length 10: ge-6/0/5.0
            END Option 255, length 0
            PAD Option 0, length 0, occurs 200

for now i think we should switch back to using the default vlan name

Change 682629 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/homer/public@master] Option 82: don't use use-vlan-id

https://gerrit.wikimedia.org/r/682629

Change 682629 merged by jenkins-bot:

[operations/homer/public@master] Option 82: don't use use-vlan-id

https://gerrit.wikimedia.org/r/682629

Just to confirm after removing use-vlan-id re-imaging of sretest1002 worked fine

Thanks for the confirmation and sorry about the trouble @jbond

Change 675932 merged by CRusnov:

[operations/software/spicerack@master] dhcp: Add module for manipulating dynamic DHCP entries

https://gerrit.wikimedia.org/r/675932

Change 723996 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] install_server: standardize DHCP includes

https://gerrit.wikimedia.org/r/723996

Mentioned in SAL (#wikimedia-operations) [2021-09-27T11:02:20Z] <volans> disabling puppet on install hosts to deploy 723996 - T221388

Change 723996 merged by Volans:

[operations/puppet@production] install_server: standardize DHCP includes

https://gerrit.wikimedia.org/r/723996

Mentioned in SAL (#wikimedia-operations) [2021-09-27T11:09:31Z] <volans> re-enabled puppet on install hosts after deployment of g/723996 - T221388

Change 724337 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] dhcp: fix typo in opt82 file path

https://gerrit.wikimedia.org/r/724337

Change 724337 merged by jenkins-bot:

[operations/software/spicerack@master] dhcp: fix typo in opt82 file path

https://gerrit.wikimedia.org/r/724337

The test of the option 82 has been successful and we're switching to this system for all physical hosts DHCP settings in T269855.
In the final setup we're using:

  • The VLAN name instead of the ID for the issue reported above
  • The primary IPv4 address that is configured in Netbox instead of the FQDN to remove a direct dependency between DCHP and DNS and because the DNS configuration is also coming from Netbox data.