Page MenuHomePhabricator

Triage and resolve all outstanding Netbox report errors
Open, NormalPublic0 Story Points

Description

Over the past few months, we've been working on setting up various consistency checks (so-called "reports") in Netbox. The goal has been to ensure that the data we have there is free from errors, typos etc. that will inevitably creep in over time.

By a rough estimate, the number of Netbox changes that have been made in various stages to address the bulk of these errors are in the multiple thousands. Only a few remain now, in about a dozen different broad categories of errors. The reports themselves have been fine-tuned to avoid needlessly reporting for non-issues (e.g. not warn about serial numbers for storage bins) or known noise (e.g. all of esams, temporarily).

There are only a few outstanding failures right now, and I'm filing this task so that we can triage them and resolve them, one way or another.

PuppetDB
This checks the Netbox data against the data collected from Puppet on the (online) hosts, which in turn uses Facter to grab facts, which in turn uses data from SMBIOS/DMI including manufacturer, model name, serial number, etc. It also, by extension, does Status field checks where it makes sense (e.g. if a host is Offline but is in PuppetDB, that's an error). Tons of (already resolved but very real) issues have been identified with this, these remain:

  • wmf7622 (missing physical device in PuppetDB: state Failed in Netbox): requires either a redefinition of Failed (in workflows, documentation and the Netbox report), or a status change for that single host ( cf. T222922).
  • ms-be10NN/ms-be20NN (unexpected state for physical device: Decommissioning in netbox): cf. T221068

Coherence
This makes various "coherency" checks, such as checking for malformed asset tags, duplicate serials, missing fields, status mismatches etc. This also has found a ton of real issues, most of which have been resolved. Remaining:

  • Tons of no rack defined for status Decommissioning device: I think most (if not all) are legitimate errors, of offlined/inventoried hosts that are in storage. Low hanging fruit?
  • msw1-eqsin (missing serial): legitimate error
  • ps1-*-eqsin (missing serial and missing asset tag): legitimate error, see T211368 but also T223443 which may make it moot.
  • msw-*-eqiad (missing serial): legitimate error
  • scs-a1-codfw (missing serial): legitimate error but relatively hard to fix, cf. T221984
  • bunch of defined for status Offline device for codfw: I think these are legitimate but need further investigation? (T223468)
  • labvirt1010/1011 (rack defined for status Offline device: eqiad-B3): I think these are actually equipment that was returned to the leasing company, so legitimate error?
  • cablemgmt-eqiad-* (missing asset tag): @Cmjohnson had thoughts/ideas about this (replacing those cablemgmt soon?)
  • labstore1003-arrayN (missing asset tag): legitimate error (but also note that these are about to be decom'ed, cf. T187456)
  • ps1-a3/4-sdtpa (missing asset tag): legitimate error

Accounting
This checks the Netbox data against Finance's spreadsheet in Google Sheets, that has all the assets as listed in invoices (serial numbers, PO # == procurement task). This has found real errors on both ends so far.

  • mw2266/mw2280 swapped asset tags: Finance is aware, pending a change from them
  • Juniper equipment @ eqsin not present in Accounting: Finance is aware, pending a change from them
  • flerovium not present in Accounting: Finance is aware, pending a change from them
  • ms-be2047 not present in Accounting and a device not present in Netbox: motherboard swap on our end (T209921), I've sent the packing slip for the replacement from that task to Finance, so pending a change from them.

ManagementConsole
This checks for the connectivity of console ports in networking equipment and alerts if there is none documented.

More tests in these reports, as well as entire new reports are coming over time (e.g. T221507 is a big one). We have a lot of flexibility here! If the DC-Ops team can think of more ideas or fixes to the existing ones, please do reach out ands/or file tasks tagged netbox and we'll figure it out :)

Final thing: we've prepared a new Icinga alert that would alert if a report fails. The idea is that we would set it up to regularly check those reports and let us know when errors crop up, so that we identify these issues in a timely manner. The deployment of that is probably pending the resolution of this task, as it wouldn't make sense otherwise.

Event Timeline

faidon created this task.May 16 2019, 1:34 PM
faidon triaged this task as Normal priority.

ManagementConsole
(needs further research, @ayounsi knows more about this one)

They're all waiting for the following tasks: T218734 - T208788 - T208734 - T211998 - T172459
Plus esams.

ayounsi updated the task description. (Show Details)May 16 2019, 4:59 PM
faidon updated the task description. (Show Details)May 17 2019, 11:03 AM

Change 510944 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/netbox-reports@master] Exclude esams from management report

https://gerrit.wikimedia.org/r/510944

Change 510944 merged by Faidon Liambotis:
[operations/software/netbox-reports@master] Exclude esams from management report

https://gerrit.wikimedia.org/r/510944

faidon updated the task description. (Show Details)May 17 2019, 4:17 PM
faidon updated the task description. (Show Details)May 22 2019, 7:25 PM
Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.Tue, May 28, 2:53 PM