Page MenuHomePhabricator

track NIC firmware version numbers across the fleet
Closed, ResolvedPublic

Description

It came up in some investigation of weird CPU0 load issues on lvs1014 that not all of our Broadcom NICs in use on LVSen have the same firmware version loaded -- see P9437. Firmware issues have been a problem in the past: T203194#4880083 and onwards.

It'd be nice to be tracking what versions are in use where, ideally with some temporal history as well. Two options:

  • As a custom Puppet fact, likely extending the current net_driver custom fact we export already. Does not seem too hard to also have that Ruby invoke ethtool -i and read the output, probably just copying the firmware-version: line into a firmware_version key.
  • As a Prometheus metric, likely exported via a textfile exporter invoked by a systemd timer. The metric value would always be just 1 and the labels would specify all the data -- it'd look something like nic_firmware_version{instance="lvs1001:9xxx",interface="enp4s0f0",driver="bnx2x",firmware_version="FFV14.10.07 bc 7.14.11"} 1, in the same style as is recommended for exporting software versions.

Possibly we'd want to do both?

Event Timeline

CDanis created this task.Oct 28 2019, 9:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2019, 9:20 PM

Change 546953 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] net_driver fact: add firmware_version

https://gerrit.wikimedia.org/r/546953

Change 546953 merged by CDanis:
[operations/puppet@production] net_driver fact: add firmware_version

https://gerrit.wikimedia.org/r/546953

Ottomata assigned this task to CDanis.Oct 29 2019, 3:33 PM
Ottomata triaged this task as Medium priority.
Ottomata added a subscriber: Ottomata.

CDanis: assigning to you as part of clinic duty

Change 546973 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] net-driver fact: tweak regexp

https://gerrit.wikimedia.org/r/546973

Change 546973 merged by CDanis:
[operations/puppet@production] net-driver fact: tweak regexp

https://gerrit.wikimedia.org/r/546973

ema moved this task from Triage to Watching on the Traffic board.Oct 30 2019, 2:43 PM

Change 549683 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus: export NIC firmware versions

https://gerrit.wikimedia.org/r/549683

This might help issues like T242481

Change 549683 merged by CDanis:
[operations/puppet@production] prometheus: export NIC firmware versions

https://gerrit.wikimedia.org/r/549683

CDanis closed this task as Resolved.May 19 2020, 3:59 PM

Prometheus metrics now exist, via a textfile exporter installed by Puppet on every physical host.

Sample output for the metric on an LVS machine:

# HELP node_nic_firmware_version A metric with a constant '1' value with labels indicating NIC interface name, driver name, and firmware version string.
# TYPE node_nic_firmware_version gauge node_nic_firmware_version{device="eno1",driver="tg3",firmware_version="FFV7. 
node_nic_firmware_version{device="ens2f0np0",driver="bnxt_en",firmware_version="214.0.253.1/pkg 21.40.25.31"} 1
node_nic_firmware_version{device="ens2f1np1",driver="bnxt_en",firmware_version="214.0.253.1/pkg 21.40.25.31"} 1
node_nic_firmware_version{device="ens3f0np0",driver="bnxt_en",firmware_version="214.0.253.1/pkg 21.40.25.31"} 1
node_nic_firmware_version{device="ens3f1np1",driver="bnxt_en",firmware_version="214.0.253.1/pkg 21.40.25.31"} 1

Since these get exported by node_exporter and then scraped by Prometheus, they wind up with the expected instance and cluster labels as well. So you could run a query like sum by (driver, firmware_version) (node_nic_firmware_version{cluster=~"lvs|cache.*"}) and get the breakdown of NIC driver/firmware version across all LVS or cp hosts.