Page MenuHomePhabricator

Netbox report to validate network equipment data
Closed, ResolvedPublic0 Estimated Story Points

Description

In Netbox, we now have a PuppetDB report, that validates the Netbox data as have been entered by DC Ops folks against what the systems themselves self-report (see T212526 for more). It has been very valuable to find errors like typos, wrong statuses etc. We have ideas to add more checks into them (see the latest at T220422).

I propose to create a similar thing to validate network devices. We should at least validate:

  • the Status field (the ones that should be online are, the ones that are not aren't);
  • that the recorded/self-reported serial numbers match
  • that the recorded/self-reported models match. Note that I fixed half a dozen of wrongly-reported models in the past few days (e.g. EX4200-24Ts documented as EX4200-48T, switch models that did not exist) even in relatively newly bought gear, so this is not a theoretical issue.
  • bonus points if we also cross-check inventory items (cf. T221506)

While we could do this by polling e.g. SNMP, I think the easiest and most appropriate way to do this would to poll LibreNMS for this information. It looks like LibreNMS has an API so we may not have to resort to polling its database. Open to implementation ideas, though :)

Thoughts?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

LibreNMS's API is very limited (eg. can't access the inventory in bulk for all devices, it also doesn't play well with LDAP auth), but it does make sens to query LibreNMS (most likely directly via its database using a read-only user) instead of eg. SNMP.

LibreNMS has what they call "inventory" which is all the linecards, optics, etc... as well as virtual chassis members.
See for example: https://librenms.wikimedia.org/device/device=160/tab=entphysical/

SELECT entPhysicalSerialNum, hostname, entPhysicalName
FROM entPhysical
INNER JOIN devices
ON entPhysical.device_id = devices.device_id
WHERE entPhysicalSerialNum != 'BUILTIN'
AND entPhysicalSerialNum != ''
AND entPhysicalVendorType = 'Juniper';

returns data like:

'XXXXX', 'asw-esams.mgmt.esams.wmnet', 'Power Supply 1 @ 3/1/*'
'XXXXX', 'asw-a-codfw.mgmt.codfw.wmnet', 'FPC: QFX5100-48S-6Q @ 2/*/*'
'XXXXX, 'cr2-knams.wikimedia.org', 'TFEB Intake temperature sensor'

The serial is also present in the devices table:

SELECT hostname, serial FROM librenms.devices;

In the case of VCF the serial of the primary node will be returned.
So mangling will be definitely needed.

2nd bonus would be to check PDUs and conservers as well as they're in LibreNMS.

Some more thoughts about implementation:
It seem to make more sens to use serial numbers as "key" than hostnames.

  • the Status field (the ones that should be online are, the ones that are not aren't);

All serial numbers in LibreNMS that are also present in Netbox should be either ACTIVE or STAGED in Netbox.
Down the road we could track what's showing as down in LibreNMS, but it's so uncommon that it doesn't justify the extra work.

  • that the recorded/self-reported serial numbers match
  • that the recorded/self-reported models match

This could maybe be done in two pass, first on the devices table for the easy ones, then the entPhysical for all the remaining ones.
Using entPhysical would require some parsing of entPhysicalName. This also might solve the inventory items

All excellent points :) I especially like the PDU & scs suggestion!

To be honest, I wouldn't focus on the inventory part yet. Let's just start with some sanity check for device status/serial/model like with Servers & PuppetDB. That would get us 90% there :)

At a later point when we have the basic inventory parts of some of those devices (e.g. MX480s) then we can build the necessary tests in that report too. That's not as urgent I'd say.

herron triaged this task as Medium priority.Apr 23 2019, 2:35 PM
crusnov moved this task from Backlog to In Progress on the SRE-tools board.
crusnov moved this task from Backlog to In Progress on the User-crusnov board.

After digging and discussing I believe the way forward since the mapping is slightly ... weird between LibreNMS and Netbox:

  • test_nb_device_in_librenms: every Staged,Active asw Device in Netbox is checked to exist in the librenms entphysical table by device_serial. We can also match model and make here.
  • test_nb_inventory_in_librenms: every Staged,Active asw Device's inventory in Netbox is checked to exist in librenms entphysical table by device_serial.
  • test_librenms_in_nb: Every devices device in librenms is checked to exist as a Netbox Device by serial number.

@ayounsi is this correct?

  • test_nb_device_in_librenms: every Staged,Active asw Device in Netbox is checked to exist in the librenms entphysical table by device_serial. We can also match model and make here.

That's correct for asw.
Note that all other types of network devices should be checked against LibreNMS devices table.

  • test_nb_inventory_in_librenms: every Staged,Active asw Device's inventory in Netbox is checked to exist in librenms entphysical table by device_serial.

I think it could actually be extended to everything that have "inventory" items in Netbox. From https://netbox.wikimedia.org/dcim/inventory-items/ : asw, pfw, msw, fasw, cr.

  • test_librenms_in_nb: Every devices device in librenms is checked to exist as a Netbox Device by serial number.

As not every devices have their serial in LibreNMS, I'd phrase it as "Every devices device with a serial# in librenms"

An important one is to check that all LibreNMS entphysical items exist in Netbox, either as devices or inventory.
To catch any situation where we replace a part (eg. PSU) on the device, and forget to update Netbox.

Change 510256 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-reports@master] Add LibreNMS parity check report.

https://gerrit.wikimedia.org/r/510256

The

Change 510256 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-reports@master] Add LibreNMS parity check report.

https://gerrit.wikimedia.org/r/510256

This does the above checks, but does not check device description parity (it has been suggested this would be a bit complicated, and after looking at it it does seem a bit annoying but not unassailable but we'd have to make a fairly comprehensive mapping between values in LibreNMS's description field [and also parse out position data and other things which are also shoved in there at least for entPhysical] and the Manufacturer [or not depending on the device from LibreNMS] and Model in Netbox ) , nor status for the device . Status seems like another can of worms since while device has a status field, some devices from Netbox's perspective are entphysical (inventory items) and don't have a status in a normal sense (entPhysical_state may be for this but I don't see any examples of what sort of data may be here).

It was pointed out to me that the vendor name in entPhysical is there, so we could hypothetically check that (for inventory items only) - the devices table remains complex.

Hello here is the sample output. There are several inconsistencies that I can see the fix for that I'd already attempted to mitigate (but not successfully apparently) such as devices like Netbox devtype=Juniper EX4600-40F, LibreNMS devtype=Juniper Networks, Inc. ex4600-40f Ethernet Switch, kernel JUNOS 14.1X53-D45.3, Build date: 2017-07-28 01:39:39 UTC Copyright (c) 1996-2017 Juniper Networks, Inc. or Juniper EX4600 where the information is there it's just not lined up the same. Other things seem less obvious, like duplicated serial numbers and similar.

(moved to google doc)
https://docs.google.com/document/d/1ffWVMKFgevRGESa3TgLwq9VOxUAOt43jLdy_bLVGSas/edit?usp=sharing

Note that this already helped find inconsistencies:

  • Some devices had status active while they shouldn't (cr3-esams, old eqiad A row EX4500)

test_nb_inventory_in_librenms should now be all green

  • Typoed serial numbers for some PDUs
  • cr1-eqsin serial is correct in LibreNMS but not Netbox
  • Missmatch serial number - T224515

Once the one above is fixed, test_librenms_in_nb should be all green

duplicate serial numbers from LibreNMS for entPhysical: XXX seen 2 times

Are because some devices use the same serial# for several parts (eg. midplane and routing engine)

I don't think we should check for duplicate serial numbers in LibreNMS

  • asw2-b[3|5|6|8]-eqiad serial typoed in Netbox
  • asw3-a5-eqiad has been swapped with ex4300-spare2-eqiad
  • lab-ex4200 was set to active, while it should be planned at best
  • Generally ignore everything esams related

That doesn't address everything but should clean up the output quite a bit.

  • esams should be blacklisted for now indeed.
  • test_nb_inventory_in_librenms could use some improvement -- it didn't say which device, s/n or anything to identify them as far as I can tell?
  • On the device types errors, I can't help but think that we're looking at the wrong field? e.g. take cr1-eqsin as an example: the message says Netbox devtype=Juniper MX104, LibreNMS devtype=Juniper 750-062050, but LibreNMS does know this is an MX104 (see under "Hardware" here).
  • I don't know what these "duplicate serial numbers" are, and we'd need more information to understand if these are real errors or report errors.
  • The cr1-eqsin serial change is a bit odd. Netbox used to have a record of what Juniper reports as the "midplane" serial number, not the "chassis". This was changed, but the midplane was what we had from the invoice as well -- so note that the Accounting report is now error'ing out instead.
  • asw-N-eqiad serial changes above -- these are now inconsistent with what we have from the Accounting side (so the report fails now). This needs further investigation for which one is ground truth?
  • The cr1-eqsin serial change is a bit odd. Netbox used to have a record of what Juniper reports as the "midplane" serial number, not the "chassis". This was changed, but the midplane was what we had from the invoice as well -- so note that the Accounting report is now error'ing out instead.
  • asw-N-eqiad serial changes above -- these are now inconsistent with what we have from the Accounting side (so the report fails now). This needs further investigation for which one is ground truth?

Those last two points might need their own task.
As data point, using https://entitlementsearch.juniper.net/entitlementsearch/
The "chassis" serial# does have support, not the "midplane"
Note that there was no packing slip, so it's possible that the accounting serial was copied from Netbox.

Taking one of the problematic switches (asw2-b3-eqiad):
[...]744[...] is the serial reported by the switch, and show a support ended on 20-Dec-2018, install city of San Francisco
While
[...]743[...] was the one previously in Netbox (and accounting), showing an active support and install city of Ashburn
the Juniper install base report shows neither but that's a known bug on their side.
I don't know how the serials got communicated to accounting, but so far I'd tend to trust what the device is reporting.

ayounsi mentioned this in Unknown Object (Task).May 30 2019, 4:43 PM
  • On the device types errors, I can't help but think that we're looking at the wrong field? e.g. take cr1-eqsin as an example: the message says Netbox devtype=Juniper MX104, LibreNMS devtype=Juniper 750-062050, but LibreNMS does know this is an MX104 (see under "Hardware" here).

Indeed this seems to be the sysDescr mysql field, while the hardware one seems more appropriate. They're far to match 1:1 the Netbox types though.

  • I don't know what these "duplicate serial numbers" are, and we'd need more information to understand if these are real errors or report errors.

This is because some devices (eg. SRX220) use the same serial# for several parts (eg. midplane and routing engine) as all parts "fused" together.
So I don't think we should alert on duplicate serials, unless we do per model special cases.

It seems like part of the challenge is identifying clustered equipment (i.e. asw stacks & pfw). In those cases, the device appears in LibreNMS as one device with the switches as FPC linecards (presumably as inventory?), while on the Netbox end they appear as separate, distinct devices. I haven't looked at this deeply, but I suppose a lot of the complexity in the report comes from there.

Netbox's modeling of switch stacks isn't generally great, but there is some support for it. In Netbox, we do have these set up as a "virtual chassis", e.g. see the page for asw-a1-codfw. The name for the virtual chassis (e.g. asw-a-codfw in this case) that would be the device's name in LibreNMS, isn't documented anywhere in Netbox right now. However, Netbox's virtual chassis have a "domain" attribute that we currently are not setting, but we can and probably should to facilitate these searches.

I assume this may make all these matches much easier. The process on the LibreNMS->Netbox direction would basically be:

  • If the LibreNMS name exists as a Netbox device, use that to do cross-checks.
  • If it doesn't, search for a virtual chassis with that domain name. If that exists, then use the LibreNMS inventory items to cross-check against the Netbox devices that belong into this virtual chassis.

The opposite direction would be similar. It looks like LibreNMS does not list the inventory items in their hierarchy (at least not in the UI), so it'll still be a little tricky, but still way better and more accurate I think.

I agree that from the perspective of more closely modelling the devices between the various tools that the domain name for the VC name thing is necessary. I'm not completely clear on how that would make the matching better? Currently the by-serial matching seems to be working correctly, the complexities are mostly in lining up vendor and model information at this point, unless I'm mistaken - and this appears to be approachable either by matching things more loosely or creating a map between what's in LibreNMS and what's in Netbox. Separately, there are only a few inventory items which don't appear to line up, but I believe it's because they are builtin so they are left out of the librenms query.

Change 510256 merged by CRusnov:
[operations/software/netbox-reports@master] Add LibreNMS parity check report

https://gerrit.wikimedia.org/r/510256