Page MenuHomePhabricator

Update puppet's topology.kubernetes.io/zone logic to take into account the new setup
Closed, ResolvedPublic

Event Timeline

Clement_Goubert triaged this task as Medium priority.

Change 980925 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/deployment-charts@master] BGPPeers: add codfw racks A1 to B8

https://gerrit.wikimedia.org/r/980925

Change 980927 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] k8s topology labels: add row to rack transition

https://gerrit.wikimedia.org/r/980927

Change 980925 merged by jenkins-bot:

[operations/deployment-charts@master] BGPPeers: add codfw racks A1 to B8

https://gerrit.wikimedia.org/r/980925

Mentioned in SAL (#wikimedia-operations) [2023-12-12T15:21:50Z] <claime> Deploying new calico BGPPeers for codfw rows a/b - T352893

Summary of the discussion on the linked CR:

  • LLDP based logic runs the risk of losing calico connectivity in case of LLDP failure during a puppet run
  • GraphQL paths to get from the host to the vlan or the switch name are deep, and missing data, links, or exceptions make it difficult to export that information from Netbox to hiera
  • The primary IP of the node could be compared to the networks in modules/network/data/data.yaml at puppet run time and get the prefix/vlan name from there
  • It may be possible to add the prefix/vlan name as a fact, but populating facts from hiera is not a great pattern

I am left wondering if the fear of LLDP failure is justified or not, as I lack knowledge of that particular protocol and its stability on our architecture.

Summary of the discussion on the linked CR:

  • LLDP based logic runs the risk of losing calico connectivity in case of LLDP failure during a puppet run
  • GraphQL paths to get from the host to the vlan or the switch name are deep, and missing data, links, or exceptions make it difficult to export that information from Netbox to hiera
  • The primary IP of the node could be compared to the networks in modules/network/data/data.yaml at puppet run time and get the prefix/vlan name from there
  • It may be possible to add the prefix/vlan name as a fact, but populating facts from hiera is not a great pattern

I am left wondering if the fear of LLDP failure is justified or not, as I lack knowledge of that particular protocol and its stability on our architecture.

I 've been fearing this and started thinking about alternative paths. And I think that if we fail all of puppet hard in case of LLDP fact failure, we are going to be ok. Something like:


if !$facts['lldp'] {
    fail('LLDP fact finding failed, failing the entire puppet run to avoid unwanted kubelet topology changes')
} else if $facts['lldp'] and $facts['lldp']['parent'] =~ /^lsw/ {
   # logic for the new rack row setup
} else {
   # logic for the old rack row setup.
}

What do you think?

I am left wondering if the fear of LLDP failure is justified or not, as I lack knowledge of that particular protocol and its stability on our architecture.

Thanks for the succinct summary Clem. The fear with LLDP is mostly around if it somehow doesn't work, and the info is not in puppetdb when required. I'm not sure how likely that is.

We need to remember the other part of the feedback on the patch from @ayounsi. i.e. that we can't use the switch name to determine the required config with the current plan (which includes moving server uplinks to new switches this quarter, but leaving them on the current vlan/subnets initially - we can obviously review).

The last thinking was that that ruled out LLDP, however I see it does include the vlan-id (not name). That can be used potentially, for instance puppetdb has this for kubestage2001:

{
    "eno1": {
        "mtu": 9192,
        "port": "ge-1/0/2",
        "descr": "kubestage2001",
        "vlans": {
            "mode": "access",
            "untagged_vlan": 2017
        },
        "router": true,
        "neighbor": "asw-a-codfw.mgmt.codfw.wmnet",
        "neighbors": [
            "asw-a-codfw.mgmt.codfw.wmnet"
        ]
    },
    "parent": "asw-a-codfw.mgmt.codfw.wmnet"
}

All k8s hosts in row a/b are currently either on vlan 2017 (private1-a-codfw) or 2018 (private1-b-codfw). So possibly we could hard code a check to see if it's on those? Otherwise I think that rules out using LLDP so we need to explore the other options.

I might be missing context, but why we can't get that info from netbox? Extracting it doesn't seem that hard.. from a Netbox's nbshell:

>>> from ipam.choices import PrefixStatusChoices
>>> d = Device.objects.get(name="kubernetes1009")
>>> Prefix.objects.get(prefix__net_contains=str(d.primary_ip.address.ip), status=PrefixStatusChoices.STATUS_ACTIVE).vlan.name
'private1-b-eqiad'

I might be missing context, but why we can't get that info from netbox? Extracting it doesn't seem that hard.. from a Netbox's nbshell:

>>> from ipam.choices import PrefixStatusChoices
>>> d = Device.objects.get(name="kubernetes1009")
>>> Prefix.objects.get(prefix__net_contains=str(d.primary_ip.address.ip), status=PrefixStatusChoices.STATUS_ACTIVE).vlan.name
'private1-b-eqiad'

The hiera sync gets the info with a GraphQL query, there was some hesitation as I don't think a single query would get it? However the way you've done it is very elegant with the "prefix__net_contains", if that were available maybe it is possible. Also a second query would probably not be the end of the world.

I 've been fearing this and started thinking about alternative paths. And I think that if we fail all of puppet hard in case of LLDP fact failure, we are going to be ok. Something like:


if !$facts['lldp'] {
    fail('LLDP fact finding failed, failing the entire puppet run to avoid unwanted kubelet topology changes')
} else if $facts['lldp'] and $facts['lldp']['parent'] =~ /^lsw/ {
   # logic for the new rack row setup
} else {
   # logic for the old rack row setup.
}

What do you think?

That sounds like a great option to me ! It's lean and would works great as a safety net while giving visibility/monitoring if there is any issue (Eg. flapping Puppet runs). If that happens we could revisit the approach.

I think that if we fail all of puppet hard in case of LLDP fact failure, we are going to be ok.

LLDP with that additional protection ought to work ok. The problem remains that the switch name is not going to be enough to know what to do, so I think we need to use the vlan-id. Some options might be:

  • LLDP: Get the numeric vlan-id from LLDP instead, and look up the vlan name from that, determine if it is row-wide or rack-specific.
    • This probably needs a new export of vlan id's and names from Netbox but that doesn't seem tricky.
  • Primary IP: Take the host primary IP, look up the network it belongs to in synced hiera data 'profile::netbox::data::prefixes' to get the vlan name.

Alternately we could decide not to move K8s hosts from old switches to new ones with the rest off the estate when we tackle them rack-by-rack. That way we can keep the "asw" = old style, "lsw" = new style logic. But it means that re-ip'ing and moving the hosts to the new setup needs to be co-ordinated with DC-ops in each case, rather than something that can be done without any physical changes.

The problem remains that the switch name is not going to be enough to know what to do, so I think we need to use the vlan-id

Yep, I mentioned it in the loooong Gerrit CR comment thread, suggesting to just hardcode the 2 (or 4) relevant vlan IDs during the transition.

I sent PatchSet #4 implementing this approach (as well as the suggestion from Alex) so we have something to base our discussions on.

Yep, I mentioned it in the loooong Gerrit CR comment thread, suggesting to just hardcode the 2 (or 4) relevant vlan IDs during the transition.

Sorry missed that. Yeah I'm happy to hardcode the vlan id's I don't think we need to over-engineer it.

Mentioned in SAL (#wikimedia-operations) [2024-01-18T15:45:36Z] <claime> stopping puppet on P:kubernetes::node to deploy 980927 - T352893

Change 980927 merged by Clément Goubert:

[operations/puppet@production] k8s topology labels: add row to rack transition

https://gerrit.wikimedia.org/r/980927

Mentioned in SAL (#wikimedia-operations) [2024-01-18T15:49:11Z] <claime> Running puppet on kubestage2002 - T352893

Summary of deployment from T352883: Test IP-renumbering on kubestage2002.codfw.wmnet:

  • No-op on all nodes except kubestage2002 (expected)
  • It did change the labels correctly on kubestage2002, and ipv4 BGP worked out of the box
  • ipv6 BGP needed a restart of the calico pod

Nice !!

The v6 one is probably just a fluke, we should investigate it only if it happens on other nodes.

Nice !!

The v6 one is probably just a fluke, we should investigate it only if it happens on other nodes.

Considering this Resolved then