Page MenuHomePhabricator

Represent sub-interface and bridge device assocations in Netbox
Open, LowPublic

Description

Recently Moritz hit an issue when re-imaging a Ganeti host.

Background

The re-image cookbook needs to know the switch, switch port, and vlan id that a given host will send DHCP DISCOVER messages on when it boots. The switch inserts this info into the DHCP packets when it relays them to install hosts. The re-image cookbook adds a DHCP config snippet to match this information, and return the correct IP assignment in the DHCP OFFER.

Mechanics

The re-image cookbook gets this information from Netbox. Specifically it takes the primary IP of the given device, looks at what interface it is configured on, finds the switch port it's connected to, and checks the untagged vlan defined for that port.

That works fine for standard hosts with their primary IP configured directly on a physical interface. For the Ganeti hosts, however, there is a problem. During the first puppet run the ganeti_init.sh script is run, which configures two bridge devices on the host, and moves the primary IP from the physical interface to the new "private" bridge device. The server's physical interface is then made a member of the bridge, so this works fine on the data plane.

The re-image cookbook calls the puppetdb import script as part of it's execution, which takes the interface data from puppetdb and updates the device interfaces in Netbox based on that. In the case of Ganeti this causes 3 new virtual interfaces to be created, one a vlan sub-interface (eno1.xxxx), and two bridge devices (private, public). The primary IP of the server is moved in Netbox from the physical interface (linked to switch port) to the virtual "private" device.

As a result of all that, when the re-image cookbook gets the device associated with the device's primary IP, it is a "virtual" device (the private bridge device), which has no associated switch port. So getting the switch port fails.

Data Model

Our current Netbox version does not allow us to define the relationship between physical ports and 802.1q sub-interfaces. Nor does it allow the creation of "bridge" devices/interfaces, which can be the "parent" of other interfaces. We have both of these types on our Ganeti hosts.

The good news is that the ability to define both of these has been added to more recent/upcoming Netbox versions:

Sub-interfaces: https://github.com/netbox-community/netbox/issues/1519

Bridges: https://github.com/netbox-community/netbox/issues/6346

I think it makes total sense for us to leverage these built-in types once we upgrade to a supporting Netbox release.

Re-image cookbook changes

If we did use these types the re-image cookbook would need to be changed to:

  1. Get the primary IP interface as it is already
  2. If the interface has a connection then proceed as normal
  3. If the interface has no connection, get the interface type:
    1. If the interface "type" is a bridge.
      1. Cycle through the interfaces on device that belong to bridge
        1. If member device has a connection, get switch port from that
NOTE: We'll never have more than 1 bridge member that is a physical device, i.e. we never want to have servers acting as actual switches shuffling frames between ports. Also the above assumes that the "primary IP" will always be connected via regular untagged Ethernet. This is likely to support initial deploy/DHCP. If we ever have primary IPs on Vlan sub-interfaces the process has a few more steps, but we could still find the correct switch port deterministically.

Which hopefully is not too hard to implement.

Puppet DB

A perhaps more difficult problem to solve is how to get the data into Netbox in the first place. This is added to Netbox by the puppetdb import script, however puppetdb does not define the interface types or relationships. For example:

cmooney@ganeti2010:~$ sudo facter -p networking
{
  domain => "codfw.wmnet",
  fqdn => "ganeti2010.codfw.wmnet",
  hostname => "ganeti2010",
  interfaces => {
    eno1 => {
      mac => "4c:d9:8f:6d:a0:85",
      mtu => 1500
    },
    eno1.2003 => {
      mac => "4c:d9:8f:6d:a0:85",
      mtu => 1500
    },
    eno2 => {
      mac => "4c:d9:8f:6d:a0:86",
      mtu => 1500
    },
    lo => {
      bindings => [
        {
          address => "127.0.0.1",
          netmask => "255.0.0.0",
          network => "127.0.0.0"
        }
      ],
      bindings6 => [
        {
          address => "::1",
          netmask => "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff",
          network => "::1"
        }
      ],
      ip => "127.0.0.1",
      ip6 => "::1",
      mtu => 65536,
      netmask => "255.0.0.0",
      netmask6 => "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff",
      network => "127.0.0.0",
      network6 => "::1"
    },
    private => {
      bindings => [
        {
          address => "10.192.32.139",
          netmask => "255.255.252.0",
          network => "10.192.32.0"
        }
      ],
      bindings6 => [
        {
          address => "2620:0:860:103:4ed9:8fff:fe6d:a085",
          netmask => "ffff:ffff:ffff:ffff::",
          network => "2620:0:860:103::"
        },
        {
          address => "fe80::4ed9:8fff:fe6d:a085",
          netmask => "ffff:ffff:ffff:ffff::",
          network => "fe80::"
        }
      ],
      ip => "10.192.32.139",
      ip6 => "2620:0:860:103:4ed9:8fff:fe6d:a085",
      mac => "4c:d9:8f:6d:a0:85",
      mtu => 1500,
      netmask => "255.255.252.0",
      netmask6 => "ffff:ffff:ffff:ffff::",
      network => "10.192.32.0",
      network6 => "2620:0:860:103::"
    },
    public => {
      bindings6 => [
        {
          address => "fe80::4ed9:8fff:fe6d:a085",
          netmask => "ffff:ffff:ffff:ffff::",
          network => "fe80::"
        }
      ],
      ip6 => "fe80::4ed9:8fff:fe6d:a085",
      mac => "4c:d9:8f:6d:a0:85",
      mtu => 1500,
      netmask6 => "ffff:ffff:ffff:ffff::",
      network6 => "fe80::"
    }
  },
  ip => "10.192.32.139",
  ip6 => "2620:0:860:103:4ed9:8fff:fe6d:a085",
  mac => "4c:d9:8f:6d:a0:85",
  mtu => 1500,
  netmask => "255.255.252.0",
  netmask6 => "ffff:ffff:ffff:ffff::",
  network => "10.192.32.0",
  network6 => "2620:0:860:103::",
  primary => "private"
}

The MAC address information might allow us to associate interfaces, but their exact types and which is parent/child would still be unavailable. MACs may not also be the same (although by default they would be).

Required Info

Ultimately we need to know what devices are bridges, and what other interfaces are members of each bridge:

cmooney@ganeti2010:~$ sudo brctl show
bridge name	bridge id		STP enabled	interfaces
private		8000.4cd98f6da085	no		eno1
public		8000.4cd98f6da085	no		eno1.2003

As well as what 802.1q sub-interfaces are defined, and what their parent interfaces are:

cmooney@ganeti2010:~$ ip -d link show type vlan 
5: eno1.2003@eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master public state UP mode DEFAULT group default qlen 1000
    link/ether 4c:d9:8f:6d:a0:85 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    vlan protocol 802.1Q id 2003 <REORDER_HDR>

This is where my own knowledge sort of runs out though. I'm not sure how or if we can get this information exposed in puppetdb, so it is available to the import script to populate Netbox correctly.

Event Timeline

cmooney created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 742948 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: support Ganeti hosts

https://gerrit.wikimedia.org/r/742948

Change 742948 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: support Ganeti hosts

https://gerrit.wikimedia.org/r/742948

@cmooney we have the possibility to add custom facts to puppetdb, we already have a bunch of them, or modify existing ones (probably less preferred, but check with jbond). Those are basically ruby scripts so if the information can be gathered shouldn't be a problem to gather it.
Do other hosts have the same issue? Cloudvirt or LVS hosts come to mind as possible hosts having similar issues.

@Volans thanks for the info. Sounds like we have a way forward if we want to do this. And certainly if we expand our use of bridges, sub-interfaces or more exotic network configurations on hosts we should definitely document them properly.

Right now I'm not aware of any other hosts that have the specific combination of elements that are causing a problem with the ganeti hosts.

  • LVS have multiple vlan sub-interfaces, but the primary IP for the server remains on the "untagged" physical interface.
  • Cloudvirt has a bridge defined, but the "external" member interface is actually a vlan one. The device primary IP remains on the physical interface untagged.

In both cases the changes could help us properly document what is set up and how it fits together, but neither are causing an operational issue like the ganeti one was.

Change 747099 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.dhcp: add support for Ganeti hosts

https://gerrit.wikimedia.org/r/747099

Change 747099 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.dhcp: add support for Ganeti hosts

https://gerrit.wikimedia.org/r/747099

Change 812288 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/netbox-extras@master] Add parent support for servers interfaces creation

https://gerrit.wikimedia.org/r/812288

I took a naive approach with the above patch (only checks for the usual "dot" delimiter.

Similarly I tested the following on netbox-dev:

for d in Device.objects.all():
    for i in d.interfaces.all():
        if not '.' in i.name or i.parent:
            continue
        parent_name = i.name.split('.')[0]
        try:
            parent = Interface.objects.get(name=parent_name,device_id=d.id)
        except Interface.DoesNotExist:
            continue
        print(f"{d.name}: {i.name} -> {parent.name}")
            # i.parent = parent
            # i.save()

Which returned:

cloudgw1001: dataplane.1107 -> dataplane
cloudgw1001: dataplane.1120 -> dataplane
cloudgw1002: dataplane.1107 -> dataplane
cloudgw1002: dataplane.1120 -> dataplane
cloudgw2001-dev: eno2.2107 -> eno2
cloudgw2001-dev: eno2.2120 -> eno2
cloudgw2002-dev: eno2.2107 -> eno2
cloudgw2002-dev: eno2.2120 -> eno2
cloudnet1003: eno50.1105 -> eno50
cloudnet1003: eno50.1107 -> eno50
cloudnet1004: eno50.1105 -> eno50
cloudnet1004: eno50.1107 -> eno50
cloudnet2005-dev: eno2.2105 -> eno2
cloudnet2005-dev: eno2.2107 -> eno2
cloudnet2006-dev: eno2.2105 -> eno2
cloudnet2006-dev: eno2.2107 -> eno2
cloudsw1-c8-eqiad: xe-0/0/0.1000 -> xe-0/0/0
cloudsw1-c8-eqiad: xe-0/0/0.1102 -> xe-0/0/0
cloudsw1-c8-eqiad: lo0.5001 -> lo0
cloudsw1-d5-eqiad: xe-0/0/0.1100 -> xe-0/0/0
cloudsw1-d5-eqiad: xe-0/0/0.1103 -> xe-0/0/0
cloudsw1-d5-eqiad: lo0.5001 -> lo0
cloudsw1-e4-eqiad: lo0.5001 -> lo0
cloudsw1-f4-eqiad: lo0.5001 -> lo0
cloudvirt1016: enp4s0f1.1105 -> enp4s0f1
cloudvirt1017: enp4s0f1.1105 -> enp4s0f1
cloudvirt1019: eno50.1105 -> eno50
cloudvirt1020: eno50.1105 -> eno50
cloudvirt1021: eno2.1105 -> eno2
cloudvirt1022: eno2.1105 -> eno2
cloudvirt1025: eno2np1.1105 -> eno2np1
cloudvirt1026: eno2np1.1105 -> eno2np1
cloudvirt1027: eno2np1.1105 -> eno2np1
cloudvirt1028: eno2np1.1105 -> eno2np1
cloudvirt1029: eno2np1.1105 -> eno2np1
cloudvirt1030: eno2np1.1105 -> eno2np1
cloudvirt1031: eno2np1.1105 -> eno2np1
cloudvirt1032: eno2np1.1105 -> eno2np1
cloudvirt1033: eno2np1.1105 -> eno2np1
cloudvirt1034: eno2np1.1105 -> eno2np1
cloudvirt1035: eno2np1.1105 -> eno2np1
cloudvirt1036: eno2np1.1105 -> eno2np1
cloudvirt1037: eno2np1.1105 -> eno2np1
cloudvirt1038: eno2np1.1105 -> eno2np1
cloudvirt1039: eno2np1.1105 -> eno2np1
cloudvirt1040: eno2np1.1105 -> eno2np1
cloudvirt1041: eno2np1.1105 -> eno2np1
cloudvirt1042: eno2np1.1105 -> eno2np1
cloudvirt1043: eno2np1.1105 -> eno2np1
cloudvirt1044: eno2np1.1105 -> eno2np1
cloudvirt1045: eno2np1.1105 -> eno2np1
cloudvirt1046: eno2np1.1105 -> eno2np1
cloudvirt2001-dev: eno1.2105 -> eno1
cloudvirt2002-dev: eno2.2105 -> eno2
cloudvirt2003-dev: eno2.2105 -> eno2
cloudvirt-wdqs1001: eno2.1105 -> eno2
cloudvirt-wdqs1002: eno2.1105 -> eno2
cloudvirt-wdqs1003: eno2.1105 -> eno2
cr1-codfw: ae1.401 -> ae1
cr1-codfw: ae1.2001 -> ae1
cr1-codfw: ae1.2017 -> ae1
cr1-codfw: ae1.2201 -> ae1
cr1-codfw: ae2.2002 -> ae2
cr1-codfw: ae2.2018 -> ae2
cr1-codfw: ae2.2118 -> ae2
cr1-codfw: ae2.2120 -> ae2
cr1-codfw: ae2.2122 -> ae2
cr1-codfw: ae3.2003 -> ae3
cr1-codfw: ae3.2019 -> ae3
cr1-codfw: ae4.2004 -> ae4
cr1-codfw: ae4.2020 -> ae4
cr1-eqiad: et-1/0/2.100 -> et-1/0/2
cr1-eqiad: xe-3/0/4.1000 -> xe-3/0/4
cr1-eqiad: xe-3/0/4.1102 -> xe-3/0/4
cr1-eqiad: xe-4/2/2.12 -> xe-4/2/2
cr1-eqiad: xe-4/2/2.13 -> xe-4/2/2
cr1-eqiad: xe-4/2/2.16 -> xe-4/2/2
cr1-eqiad: ae1.401 -> ae1
cr1-eqiad: ae1.1001 -> ae1
cr1-eqiad: ae1.1017 -> ae1
cr1-eqiad: ae1.1030 -> ae1
cr1-eqiad: ae1.1117 -> ae1
cr1-eqiad: ae2.1002 -> ae2
cr1-eqiad: ae2.1018 -> ae2
cr1-eqiad: ae2.1021 -> ae2
cr1-eqiad: ae2.1202 -> ae2
cr1-eqiad: ae3.1003 -> ae3
cr1-eqiad: ae3.1019 -> ae3
cr1-eqiad: ae3.1022 -> ae3
cr1-eqiad: ae3.1119 -> ae3
cr1-eqiad: ae4.1004 -> ae4
cr1-eqiad: ae4.1020 -> ae4
cr1-eqiad: ae4.1023 -> ae4
cr2-codfw: xe-1/1/1:0.100 -> xe-1/1/1:0
cr2-codfw: ae1.402 -> ae1
cr2-codfw: ae1.2001 -> ae1
cr2-codfw: ae1.2017 -> ae1
cr2-codfw: ae1.2201 -> ae1
cr2-codfw: ae2.2002 -> ae2
cr2-codfw: ae2.2018 -> ae2
cr2-codfw: ae2.2118 -> ae2
cr2-codfw: ae2.2120 -> ae2
cr2-codfw: ae2.2122 -> ae2
cr2-codfw: ae3.2003 -> ae3
cr2-codfw: ae3.2019 -> ae3
cr2-codfw: ae4.2004 -> ae4
cr2-codfw: ae4.2020 -> ae4
cr2-drmrs: xe-0/1/1.16 -> xe-0/1/1
cr2-drmrs: xe-0/1/1.26 -> xe-0/1/1
cr2-eqiad: et-1/0/2.100 -> et-1/0/2
cr2-eqiad: xe-3/0/4.1100 -> xe-3/0/4
cr2-eqiad: xe-3/0/4.1103 -> xe-3/0/4
cr2-eqiad: ae1.402 -> ae1
cr2-eqiad: ae1.1001 -> ae1
cr2-eqiad: ae1.1017 -> ae1
cr2-eqiad: ae1.1030 -> ae1
cr2-eqiad: ae1.1117 -> ae1
cr2-eqiad: ae2.1002 -> ae2
cr2-eqiad: ae2.1018 -> ae2
cr2-eqiad: ae2.1021 -> ae2
cr2-eqiad: ae2.1202 -> ae2
cr2-eqiad: ae3.1003 -> ae3
cr2-eqiad: ae3.1019 -> ae3
cr2-eqiad: ae3.1022 -> ae3
cr2-eqiad: ae3.1119 -> ae3
cr2-eqiad: ae4.1004 -> ae4
cr2-eqiad: ae4.1020 -> ae4
cr2-eqiad: ae4.1023 -> ae4
cr2-eqsin: ae1.402 -> ae1
cr2-eqsin: ae1.510 -> ae1
cr2-eqsin: ae1.520 -> ae1
cr2-eqsin: ae1.530 -> ae1
cr2-esams: gr-0/1/0.2 -> gr-0/1/0
cr2-esams: ae1.100 -> ae1
cr2-esams: ae1.102 -> ae1
cr2-esams: ae1.103 -> ae1
cr2-esams: ae1.403 -> ae1
cr2-esams: ae1.404 -> ae1
cr2-esams: ae2.380 -> ae2
cr2-esams: ae2.381 -> ae2
cr2-knams: lo0.0 -> lo0
cr3-eqsin: ae1.401 -> ae1
cr3-eqsin: ae1.510 -> ae1
cr3-eqsin: ae1.520 -> ae1
cr3-eqsin: ae1.530 -> ae1
cr3-esams: gr-0/0/0.1 -> gr-0/0/0
cr3-esams: gr-0/0/0.2 -> gr-0/0/0
cr3-esams: ae1.100 -> ae1
cr3-esams: ae1.102 -> ae1
cr3-esams: ae1.103 -> ae1
cr3-esams: ae1.401 -> ae1
cr3-esams: ae1.402 -> ae1
cr3-knams: xe-0/1/5.13 -> xe-0/1/5
cr3-knams: xe-0/1/5.23 -> xe-0/1/5
cr3-knams: ae1.401 -> ae1
cr3-knams: ae1.403 -> ae1
cr3-ulsfo: et-0/0/1.401 -> et-0/0/1
cr3-ulsfo: et-0/0/1.501 -> et-0/0/1
cr3-ulsfo: et-0/0/1.1201 -> et-0/0/1
cr3-ulsfo: et-0/0/1.1211 -> et-0/0/1
cr3-ulsfo: et-0/0/1.1221 -> et-0/0/1
cr3-ulsfo: ae0.2 -> ae0
cr4-ulsfo: et-0/0/1.402 -> et-0/0/1
cr4-ulsfo: et-0/0/1.501 -> et-0/0/1
cr4-ulsfo: et-0/0/1.1201 -> et-0/0/1
cr4-ulsfo: et-0/0/1.1211 -> et-0/0/1
cr4-ulsfo: et-0/0/1.1221 -> et-0/0/1
cr4-ulsfo: gr-0/0/0.2 -> gr-0/0/0
cr4-ulsfo: ae0.2 -> ae0
lsw1-e1-eqiad: et-0/0/48.100 -> et-0/0/48
lsw1-e1-eqiad: lo0.1 -> lo0
lsw1-e2-eqiad: lo0.1 -> lo0
lsw1-e3-eqiad: lo0.1 -> lo0
lsw1-f1-eqiad: et-0/0/48.100 -> et-0/0/48
lsw1-f1-eqiad: lo0.1 -> lo0
lsw1-f2-eqiad: lo0.1 -> lo0
lsw1-f3-eqiad: lo0.1 -> lo0
lvs1017: eno1np0.1001 -> eno1np0
lvs1017: eno2np1.1002 -> eno2np1
lvs1017: eno2np1.1018 -> eno2np1
lvs1017: ens1f0np0.1003 -> ens1f0np0
lvs1017: ens1f0np0.1019 -> ens1f0np0
lvs1017: ens1f1np1.1004 -> ens1f1np1
lvs1017: ens1f1np1.1020 -> ens1f1np1
lvs1017: ens2f0np0.1031 -> ens2f0np0
lvs1017: ens2f0np0.1032 -> ens2f0np0
lvs1017: ens2f0np0.1033 -> ens2f0np0
lvs1017: ens2f0np0.1035 -> ens2f0np0
lvs1017: ens2f0np0.1036 -> ens2f0np0
lvs1017: ens2f0np0.1037 -> ens2f0np0
lvs1018: eno1np0.1002 -> eno1np0
lvs1018: eno2np1.1001 -> eno2np1
lvs1018: eno2np1.1017 -> eno2np1
lvs1018: ens1f0np0.1003 -> ens1f0np0
lvs1018: ens1f0np0.1019 -> ens1f0np0
lvs1018: ens1f0np0.1119 -> ens1f0np0
lvs1018: ens1f1np1.1004 -> ens1f1np1
lvs1018: ens1f1np1.1020 -> ens1f1np1
lvs1018: ens2f0np0.1031 -> ens2f0np0
lvs1018: ens2f0np0.1032 -> ens2f0np0
lvs1018: ens2f0np0.1033 -> ens2f0np0
lvs1018: ens2f0np0.1035 -> ens2f0np0
lvs1018: ens2f0np0.1036 -> ens2f0np0
lvs1018: ens2f0np0.1037 -> ens2f0np0
lvs1019: eno1np0.1003 -> eno1np0
lvs1019: eno2np1.1001 -> eno2np1
lvs1019: eno2np1.1017 -> eno2np1
lvs1019: ens1f0np0.1002 -> ens1f0np0
lvs1019: ens1f0np0.1018 -> ens1f0np0
lvs1019: ens1f1np1.1004 -> ens1f1np1
lvs1019: ens1f1np1.1020 -> ens1f1np1
lvs1019: ens2f0np0.1031 -> ens2f0np0
lvs1019: ens2f0np0.1032 -> ens2f0np0
lvs1019: ens2f0np0.1033 -> ens2f0np0
lvs1019: ens2f0np0.1035 -> ens2f0np0
lvs1019: ens2f0np0.1036 -> ens2f0np0
lvs1019: ens2f0np0.1037 -> ens2f0np0
lvs1020: eno1np0.1004 -> eno1np0
lvs1020: eno2np1.1001 -> eno2np1
lvs1020: eno2np1.1017 -> eno2np1
lvs1020: ens1f0np0.1002 -> ens1f0np0
lvs1020: ens1f0np0.1018 -> ens1f0np0
lvs1020: ens1f1np1.1003 -> ens1f1np1
lvs1020: ens1f1np1.1019 -> ens1f1np1
lvs1020: ens1f1np1.1119 -> ens1f1np1
lvs1020: ens2f0np0.1031 -> ens2f0np0
lvs1020: ens2f0np0.1032 -> ens2f0np0
lvs1020: ens2f0np0.1033 -> ens2f0np0
lvs1020: ens2f0np0.1035 -> ens2f0np0
lvs1020: ens2f0np0.1036 -> ens2f0np0
lvs1020: ens2f0np0.1037 -> ens2f0np0
lvs2007: ens2f0np0.2001 -> ens2f0np0
lvs2007: ens2f1np1.2002 -> ens2f1np1
lvs2007: ens2f1np1.2018 -> ens2f1np1
lvs2007: ens3f0np0.2003 -> ens3f0np0
lvs2007: ens3f0np0.2019 -> ens3f0np0
lvs2007: ens3f1np1.2004 -> ens3f1np1
lvs2007: ens3f1np1.2020 -> ens3f1np1
lvs2008: ens2f0np0.2002 -> ens2f0np0
lvs2008: ens2f1np1.2001 -> ens2f1np1
lvs2008: ens2f1np1.2017 -> ens2f1np1
lvs2008: ens3f0np0.2003 -> ens3f0np0
lvs2008: ens3f0np0.2019 -> ens3f0np0
lvs2008: ens3f1np1.2004 -> ens3f1np1
lvs2008: ens3f1np1.2020 -> ens3f1np1
lvs2009: ens2f0np0.2003 -> ens2f0np0
lvs2009: ens2f1np1.2001 -> ens2f1np1
lvs2009: ens2f1np1.2017 -> ens2f1np1
lvs2009: ens3f0np0.2002 -> ens3f0np0
lvs2009: ens3f0np0.2018 -> ens3f0np0
lvs2009: ens3f1np1.2004 -> ens3f1np1
lvs2009: ens3f1np1.2020 -> ens3f1np1
lvs2010: ens2f0np0.2004 -> ens2f0np0
lvs2010: ens2f1np1.2001 -> ens2f1np1
lvs2010: ens2f1np1.2017 -> ens2f1np1
lvs2010: ens3f0np0.2002 -> ens3f0np0
lvs2010: ens3f0np0.2018 -> ens3f0np0
lvs2010: ens3f1np1.2003 -> ens3f1np1
lvs2010: ens3f1np1.2019 -> ens3f1np1
lvs3005: ens3f0np0.100 -> ens3f0np0
lvs3006: ens3f0np0.100 -> ens3f0np0
lvs3007: ens3f0np0.100 -> ens3f0np0
lvs4005: enp5s0f0.1201 -> enp5s0f0
lvs4006: enp5s0f0.1201 -> enp5s0f0
lvs4007: enp5s0f0.1201 -> enp5s0f0
lvs5001: enp5s0f0.510 -> enp5s0f0
lvs5002: enp5s0f0.510 -> enp5s0f0
lvs5003: enp5s0f0.510 -> enp5s0f0
lvs6001: ens3f0np0.611 -> ens3f0np0
lvs6001: ens3f1np1.612 -> ens3f1np1
lvs6001: ens3f1np1.622 -> ens3f1np1
lvs6002: ens3f0np0.612 -> ens3f0np0
lvs6002: ens3f1np1.611 -> ens3f1np1
lvs6002: ens3f1np1.621 -> ens3f1np1
lvs6003: ens3f0np0.611 -> ens3f0np0
lvs6003: ens3f1np1.612 -> ens3f1np1
lvs6003: ens3f1np1.622 -> ens3f1np1
mr1-codfw: ge-0/0/1.401 -> ge-0/0/1
mr1-codfw: ge-0/0/1.402 -> ge-0/0/1
mr1-eqiad: ge-0/0/1.401 -> ge-0/0/1
mr1-eqiad: ge-0/0/1.402 -> ge-0/0/1
mr1-eqsin: ge-0/0/4.401 -> ge-0/0/4
mr1-eqsin: ge-0/0/4.402 -> ge-0/0/4
mr1-esams: ge-0/0/1.402 -> ge-0/0/1
mr1-esams: ge-0/0/1.404 -> ge-0/0/1
mr1-esams: ge-0/0/6.2483 -> ge-0/0/6
mr1-esams: ge-0/0/6.2484 -> ge-0/0/6
mr1-ulsfo: ge-0/0/4.401 -> ge-0/0/4
mr1-ulsfo: ge-0/0/4.402 -> ge-0/0/4

Running it on servers would require the above CR to be tested/merged to prevent discrepancies.

It doesn't solve the bridge point, but is a step in the good direction as we could leverage it once the needed data gets in PuppetDB.

Super work!

I'll maybe try to dig into the puppet custom facts stuff, be a chance to learn some Ruby I guess :)

@Volans could you point me at any existing custom_facts and the code were using to get them? Data can definitely be got easily from the hosts (at least with Python!), for example:

cmooney@ganeti1025:~$ cat /tmp/get_netinfo.py 
#!/bin/python3

import subprocess
import json

print("802.1q sub-interfaces:")
sub_ints = json.loads(subprocess.check_output("/usr/bin/ip -j -d link show type vlan", shell=True))
for sub_int in sub_ints:
  print(f"Device: {sub_int['ifname']}, Parent: {sub_int['link']}, Vlan ID: {sub_int['linkinfo']['info_data']['id']}")
print()

bridges = json.loads(subprocess.check_output("/usr/bin/ip -j -d link show type bridge", shell=True))
bridge_info = {}
for bridge in bridges:
  bridge_info[bridge['ifname']] = {}
  member_data = json.loads(subprocess.check_output(f"ip -j -d link show master {bridge['ifname']}", shell=True))
  for bridge_member in member_data:
    if not "info_kind" in bridge_member['linkinfo']:
      bridge_info[bridge['ifname']][bridge_member['ifname']] = None
    else:
      bridge_info[bridge['ifname']][bridge_member['ifname']] = bridge_member['linkinfo']['info_kind']

print("Bridges and member ports:")
for bridge_name, members in bridge_info.items():
  print(f"{bridge_name}: ", end="")
  for member_name, member_kind in members.items():
    if member_kind != "tun":
      print(f"{member_name}, ", end="")
  print()

Outputs:

cmooney@ganeti1025:~$ /tmp/get_netinfo.py
802.1q sub-interfaces:
Device: ens3f0np0.1001, Parent: ens3f0np0, Vlan ID: 1001
Device: ens3f0np0.1030, Parent: ens3f0np0, Vlan ID: 1030

Bridges and member ports:
private: ens3f0np0, 
public: ens3f0np0.1001, 
analytics: ens3f0np0.1030,

Depending on the depth of this rabbit hole, it might be better to focus on DHCP option 97 (which solves the same initial issue at a larger scale). Unless having this information in PuppetDB is useful for something else.

I agree it's not worth massive effort, Option 97 is the better way to resolve the initial problem for sure.

It would be good to have the network objects properly represented in Netbox though, and the info available in puppet db. I'll see how tricky it looks, if it's easy I reckon it's a good thing to have.

@Volans could you point me at any existing custom_facts and the code were using to get them? Data can definitely be got easily from the hosts (at least with Python!), for example:

Everything in modules/base/lib/facter/ in the puppet repo, each file is a custom fact.
But what exactly would you like to add to PuppetDB? Where would the data be used? We're usually trying to not add too much stuff unless it's needed somewhere to not overwhelm PuppetDB itself. Check with John ;)

Change 812288 merged by jenkins-bot:

[operations/software/netbox-extras@master] Add parent support for servers interfaces creation

https://gerrit.wikimedia.org/r/812288

Change 821781 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add additional network device info to puppet facts

https://gerrit.wikimedia.org/r/821781

For a bit of context the above patch will augment the existing vars under network.interfaces, potentially adding up to 4 keys under each interface if they apply:

KeyValue
kindNetdev 'info_kind' reported, i.e. 'bridge' or 'vlan'
dot1q802.1q tag configured if interface kind is vlan
parent_linkParent interface if interface kind is vlan
parent_bridgeParent bridge device if interface is bound to one

An example result is shown below for a vlan sub-int that is bound to a bridge device:

cmooney@ganeti1025:~$ facter --json --custom-dir /var/lib/puppet/lib/facter networking.interfaces | jq '."networking.interfaces"."ens3f0np0.1030"'
{
  "mac": "e4:3d:1a:7a:ca:40",
  "mtu": 1500,
  "kind": "vlan",
  "dot1q": 1030,
  "parent_link": "ens3f0np0",
  "parent_bridge": "analytics"
}

I've tested manually for all our currently active debian versions and seems to be ok. Will merge tomorrow and take it from there.

Change 821781 merged by Cathal Mooney:

[operations/puppet@production] Add additional network device info to puppet facts

https://gerrit.wikimedia.org/r/821781

Change 822439 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/netbox-extras@master] Adjust Netbox PuppetDB import script to set bridge dev and vlan tags

https://gerrit.wikimedia.org/r/822439

The above patch uses the new puppet facts to define vlan sub-interface and bridge relations as described in the task description.

I've copied it to netbox-next to test. I've ran it for ganeti1025 and lvs1017 as an example. @ayounsi be interested in your thoughts, esp. how best to represent in Netbox. My goal is to proceed and update the automation to set switch interface access/trunk and allowed vlans once this is merged.

My goal is to proceed and update the automation to set switch interface access/trunk and allowed vlans once this is merged.

So far Netbox is the source of truth for network devices. DCops configures switch ports during host provisioning and such setup is fairly static.
Having switch interfaces configs more dynamic brings some risks: dependency on Puppet, more possibilities of bugs, discrepancies (pending changes) between Netbox data and devices.
For example Debian's ENI interface naming caused issues in the past. Someone manually editing an interface on a host will update the source of truth.
The overall principle, is that (more or less some exceptions like VMs) Netbox drives the infrastructure, and not the other way around. Discrepancies between the ideal state (Netbox) and the live state (PuppetDB, LibreNMS) should be be reported to be fixed (with Netbox reports).
Maybe this is worth an exception, but we should look at it through the whole server's lifecycle lens and make sure it solves the initial problem in a sustainable way.

Netbox drives the infrastructure, and not the other way around.

Fully agree that's best. But unfortunately it's not the case right now for host networking. It would be ideal if all elements got defined in Netbox first and pushed out, but we are where we are. In terms of a "source of truth" I think it's important the same source defines the config for both sides of a link, i.e. we shouldn't have two sources of data for that (ideally), which can potentially be inconsistent.

But fully agree that the host as the source of truth is not a good paradigm. I guess I was thinking more on eliminating the potential for discrepancies than where the data was coming from.

The main aim here was to remove the manual step we get pinged about after servers with sub-ints are provisioned. It's only a nuisance for LVS and Ganeti, as we don't have many of them. But I'd fear it'd become a chore if we have more of them in future, for instance for WMCS.

My goal is to proceed and update the automation to set switch interface access/trunk and allowed vlans once this is merged.

Perhaps we could start off with a report as you mention. We have the sub-int data on the host side now to drive that.

Would a cookbook be an idea possibly? That we could run ourselves to update a specific network port to match the host side, if the report fired and we were happy the host config is sane?

Would a cookbook be an idea possibly? That we could run ourselves to update a specific network port to match the host side, if the report fired and we were happy the host config is sane?

I think we can figure that out later depending on how often such report triggers and the conditions that lead to it.
If it's frequent, the cookbook might be more of a stopgap and working on the overall provisioning workflow might be a better use of our time.
If it's infrequent: a manual Netbox edit + running Homer/the switch interface config cookbook might be good enough.

I had a good discussion with @jbond on irc about how we model the host interfaces in Netbox, and I think based on that it might make most sense to do it in the way shown in the screenshot below

image.png (976×1 px, 129 KB)

Basically we configure the physical interface, over which the tagged frames flow, as mode "tagged". We set the correct "untagged vlan" on this port, as well as the tags supporting any child sub-interfaces in the list of "tagged vlans".

The child sub-interfaces we instead configure in mode "access" and set the "untagged vlan" to the vlan traffic from them gets tagged with as it flows through the parent.

I think this most closely resembles what the actual kernel is doing in terms of tagging etc. The sub-ints aren't really "tagged" as such, it is just that traffic from them gets a tag applied when it goes over the parent. So sub-ints set to untagged, parent set to tagged.

For a bit of context the above patch will augment the existing vars under network.interfaces, potentially adding up to 4 keys under each interface if they apply

Just to be on the safe side, given that we got bit by this in the past, was the puppetdb size checked before/after the patch?
Flee-wide facts can easily make that grow quite a bit.

For a bit of context the above patch will augment the existing vars under network.interfaces, potentially adding up to 4 keys under each interface if they apply

Just to be on the safe side, given that we got bit by this in the past, was the puppetdb size checked before/after the patch?
Flee-wide facts can easily make that grow quite a bit.

i checked the puppetdb health after this patch was deployed and all lo0oks good. there is a increased activity after the first patch as puppetdb needs to run an update statement on the factset table which is heavy (similar to provisioning a new host) but after that initial run everything is fine .