Page MenuHomePhabricator

error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4]
Closed, ResolvedPublic

Description

I ran into this error today on ms-be1051 (seems harmless though, reproducible on every puppet run)

Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Facter: error while resolving custom fact "lldp_neighbors": undefined method `[]' for nil:NilClass
Info: Caching catalog for ms-be1051.eqiad.wmnet
Info: Applying configuration version '(545b517cea) Cwhite - logging: clean up legacy logstash alerts'

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So, after a quick check this is what I found:

<?xml version="1.0" encoding="UTF-8"?>
<lldp label="LLDP neighbors"/>
  • all the affected hosts are on stretch, but of the ~375 hosts we still have on stretch those are the only ones showing the issue, the lldpd version seems to be the same.

For the actual error message I'm sending a fix as is trivial.

Change 721031 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] facter: fix lldp_neighbors error on empty lldp

https://gerrit.wikimedia.org/r/721031

Volans triaged this task as Medium priority.Sep 14 2021, 5:13 PM

All the affected hosts are HP and seems to have a Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) network card and the i40e driver. I'm wondering if that might be the culprit in some form, we should check the driver versions.

This may be related to this reported bug. It seems these Intel cards have an on-board LLDP agent, which if enabled cause it to parse LLDP frames itself and not pass to the kernel:

https://bugs.launchpad.net/maas/+bug/1750688

According to that it can be disabled one of two ways:

echo "lldp stop" > "/sys/kernel/debug/i40e/0000:5d:00.0/command"

or

ethtool -set-priv-flags eno5 disable-fw-lldp on

This is discussed a little here: https://www.thomas-krenn.com/en/wiki/Intel_Ethernet_700_Series_LACP_Configuration

That page mentions that at least firmware version NVM 6.01 (for the NIC) and a current driver version are required. According to ethtool, the X710 in ms-be1051 has firmware 6.8 which should be ok. But it doesn't show the lldp disable option when I run the ethtool "-show-priv-flags" command:

cmooney@ms-be1051:~$ sudo /sbin/ethtool --show-priv-flags eno5 
Private flags for eno5:
MFP                    : off
LinkPolling            : off
flow-director-atr      : on
veb-stats              : off
hw-atr-eviction        : off
vf-true-promisc-support: off

Might be worth trying to echo to the "command" file and see if it works I guess?

More info: https://community.intel.com/t5/Ethernet-Products/X710-dropping-LLDP-frames/td-p/348508?start=0&tstart=0

Volans renamed this task from error while resolving custom fact "lldp_neighbors" on ms-be1051 to error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4].Sep 14 2021, 5:57 PM

We attempted disabling the NICs own LLDP parser by echoing the command and it seems to have worked, LLDP frames from the switch are visible in a tcpdump (they weren't prior):

root@relforge1004:~# echo "lldp stop" > /sys/kernel/debug/i40e/0000\:5d\:00.0/command 
root@relforge1004:~# 
root@relforge1004:~# sudo tcpdump -c 10 -e -vvv -i eno5 -l -p -nn  ether proto 0x88cc 
tcpdump: listening on eno5, link-type EN10MB (Ethernet), capture size 262144 bytes
18:12:14.995620 38:4f:49:b4:c9:c8 > 01:80:c2:00:00:0e, ethertype LLDP (0x88cc), length 342: LLDP, length 328
	Chassis ID TLV (1), length 7
	  Subtype MAC address (4): 4c:16:fc:fb:47:80
	  0x0000:  044c 16fc fb47 80
	Port ID TLV (2), length 10
	  Subtype Interface Name (5): xe-2/0/37
	  0x0000:  0578 652d 322f 302f 3337
	Time to Live TLV (3), length 2: TTL 120s
	  0x0000:  0078
	System Name TLV (5), length 12: asw2-b-eqiad
	  0x0000:  6173 7732 2d62 2d65 7169 6164
	System Description TLV (6), length 165
	  Juniper Networks, Inc. qfx5100-48s-6q Ethernet Switch, kernel JUNOS 14.1X53-D46.7, Build date: 2017-11-23 22:13:11 UTC Copyright (c) 1996-2017 Juniper Networks, Inc.
	  0x0000:  4a75 6e69 7065 7220 4e65 7477 6f72 6b73
	  0x0010:  2c20 496e 632e 2071 6678 3531 3030 2d34
	  0x0020:  3873 2d36 7120 4574 6865 726e 6574 2053
	  0x0030:  7769 7463 682c 206b 6572 6e65 6c20 4a55
	  0x0040:  4e4f 5320 3134 2e31 5835 332d 4434 362e
	  0x0050:  372c 2042 7569 6c64 2064 6174 653a 2032
	  0x0060:  3031 372d 3131 2d32 3320 3232 3a31 333a
	  0x0070:  3131 2055 5443 2043 6f70 7972 6967 6874
	  0x0080:  2028 6329 2031 3939 362d 3230 3137 204a
	  0x0090:  756e 6970 6572 204e 6574 776f 726b 732c
	  0x00a0:  2049 6e63 2e
	System Capabilities TLV (7), length 4
	  System  Capabilities [Bridge, Router] (0x0014)
	  Enabled Capabilities [Bridge, Router] (0x0014)
	  0x0000:  0014 0014
	Management Address TLV (8), length 24
	  Management Address length 5, AFI IPv4 (1): 10.65.0.25
	  Interface Index Interface Numbering (2): 35
	  OID length 12\0x01\0x03\0x06\0x01\0x02\0x01\0x1f\0x01\0x01\0x01\0x01#
	  0x0000:  0501 0a41 0019 0200 0000 230c 0103 0601
	  0x0010:  0201 1f01 0101 0123
	Port Description TLV (4), length 12: relforge1004
	  0x0000:  7265 6c66 6f72 6765 3130 3034
	Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
	  MAC/PHY configuration/status Subtype (1)
	    autonegotiation [none] (0x00)
	    PMD autoneg capability [unknown] (0x8000)
	    MAU type Unknown (0x0000)
	  0x0000:  0012 0f01 0080 0000 00
	Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
	  Link aggregation Subtype (3)
	    aggregation status [supported], aggregation port ID 0
	  0x0000:  0012 0f03 0100 0000 00
	Organization specific TLV (127), length 6: OUI IEEE 802.3 Private (0x00120f)
	  Max frame size Subtype (4)
	    MTU size 9192
	  0x0000:  0012 0f04 23e8
	Organization specific TLV (127), length 6: OUI Ethernet bridged (0x0080c2)
	  Port VLAN Id Subtype (1)
	    port vlan id (PVID): 1021
	  0x0000:  0080 c201 03fd
	Organization specific TLV (127), length 16: OUI Juniper (0x009069)
	  0x0000:  0090 6901 5441 3337 3137 3236 3033 3438
	Organization specific TLV (127), length 16: OUI Ethernet bridged (0x0080c2)
	  VLAN name Subtype (3)
	    vlan id (VID): 1021
	    vlan name: vlan-1021
	  0x0000:  0080 c203 03fd 0976 6c61 6e2d 3130 3231
	End TLV (0), length 0

So this fix should work for affected hosts with these NICs. We'll need to work out the best way to apply and how to deal with it for future builds with these NICs.

Mentioned in SAL (#wikimedia-operations) [2021-09-15T09:46:21Z] <topranks> Disabling Intel X710 NIC on-board LLDP processing on relforge1003 (T290984)

Change now made on relforge1003 also.

During change I ran "sudo ip monitor" and netlink did not report any change in link status. I also pinged relforge1003 in "rapid" mode from cr2-eqiad, which generated over 7,000 pings in a few seconds, and none were dropped. Likewise switch doesn't report change in port status, and dmesg shows no kernel-level messages about the NIC.

So I think we can be confident that issuing the commnd with "echo" to the path in /sys/kernel/debug/i40e makes the required change, and doesn't reset or interrupt the NIC or packet processing when it does so.

We now need to:

  • Update this on remaining hosts the issue has been found on.
    • Fixes the problem "right now"
  • Decide on a way to have this done at boot-time for affected hosts.
    • That also involves working out how to deal with this via automation, a difficulty is identifying hosts using the affected Intel NIC, and the PCI ID of the affected interface on each (which is part of the path the command gets echoed to).
  • Decide on a way to have this done at boot-time for affected hosts.
    • That also involves working out how to deal with this via automation, a difficulty is identifying hosts using the affected Intel NIC, and the PCI ID of the affected interface on each (which is part of the path the command gets echoed to).

Those are the only one affected hosts:

$ sudo cumin 'F:net_driver ~ "i40e"'
17 hosts will be targeted:
ms-be[2051-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet,relforge[1003-1004].eqiad.wmnet
  • Decide on a way to have this done at boot-time for affected hosts.
    • That also involves working out how to deal with this via automation, a difficulty is identifying hosts using the affected Intel NIC, and the PCI ID of the affected interface on each (which is part of the path the command gets echoed to).

For managing the sysctl setting in Puppet we have two basic options:

  1. Managing '/sys/kernel/debug/i40e/0000\:5d\:00.0/command' as a file resource to which we write "lldp stop"
  2. The sysfs::parameters define allows us to configure sysfs settings, we only need to detect the PCI ID in question.

That page mentions that at least firmware version NVM 6.01 (for the NIC) and a current driver version are required. According to ethtool, the X710 in ms-be1051 has firmware 6.8 which should be ok. But it doesn't show the lldp disable option when I run the ethtool "-show-priv-flags" command:

All the affected systems run Stretch, a backport of ethtool might also help here.

Change 721031 merged by Volans:

[operations/puppet@production] facter: fix lldp_neighbors error on empty lldp

https://gerrit.wikimedia.org/r/721031

The puppet patch has been merged, so the error showing up in facter is now gone.

Mentioned in SAL (#wikimedia-operations) [2021-09-16T17:31:08Z] <volans> turn of lldp agent on NIC (both ports) on ms-be2051 - T290984

All hosts have the same identifiers:

$ sudo cumin 'ms-be105[1-9]*,ms-be205[2-6]*' 'ls -1 /sys/kernel/debug/i40e/'
14 hosts will be targeted:
ms-be[2052-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet
Ok to proceed on 14 hosts? Enter the number of affected hosts to confirm or "q" to quit 14
===== NODE GROUP =====
(14) ms-be[2052-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet
----- OUTPUT of 'ls -1 /sys/kernel/debug/i40e/' -----
0000:5d:00.0
0000:5d:00.1
================
PASS |████████████████████████████████████████████████████████████████████████████████████████| 100% (14/14) [00:00<00:00, 16.49hosts/s]
FAIL |                                                                                                 |   0% (0/14) [00:00<?, ?hosts/s]
100.0% (14/14) success ratio (>= 100.0% threshold) for command: 'ls -1 /sys/kernel/debug/i40e/'.
100.0% (14/14) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

So to apply the fix to the remaining one we can just run:

sudo cumin -b 1 -s 30 -m async 'ms-be105[1-9]*,ms-be205[2-6]*' 'echo "lldp stop" > /sys/kernel/debug/i40e/0000\:5d\:00.0/command' 'echo "lldp stop" > /sys/kernel/debug/i40e/0000\:5d\:00.1/command' 'facter -p lldp_parent'

Mentioned in SAL (#wikimedia-operations) [2021-09-16T17:54:12Z] <volans> turn of lldp agent on NIC (both ports) on ms-be105[1-9],ms-be205[2-6] - T290984

This was completed yesterday evening for all affected hosts, and all are now reporting an LLDP neighbour as expected.

Remaining work is to puppetize the change, so that the internal LLDP function on the NIC will be disable again after a reboot.

joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 1:24 PM

all the affected hosts are on stretch, but of the ~375 hosts we still have on stretch those are the only ones showing the issue, the lldpd version seems to be the same.

All the affected hosts are HP and seems to have a Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) network card and the i40e driver. I'm wondering if that might be the culprit in some form, we should check the driver versions.

We don't have any Stretch host left with those NICs. There are currently only nine Stretch hosts, according to PuppetBoard. All of those have Broadcom NICs, tg3 or bnxt drivers.

Unless there are any objections, I think it's safe to just close this task.

all the affected hosts are on stretch, but of the ~375 hosts we still have on stretch those are the only ones showing the issue, the lldpd version seems to be the same.

All the affected hosts are HP and seems to have a Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) network card and the i40e driver. I'm wondering if that might be the culprit in some form, we should check the driver versions.

We don't have any Stretch host left with those NICs. There are currently only nine Stretch hosts, according to PuppetBoard. All of those have Broadcom NICs, tg3 or bnxt drivers.

Unless there are any objections, I think it's safe to just close this task.

Sounds good! Even those remaining ones are hopefully gone in a few weeks.

Closed due to Stretch hosts having gone away.