Page MenuHomePhabricator

Netbox Reports: General Cleanup and Improvement
Closed, ResolvedPublic

Description

We shall fix some issues with extant Netbox reports:

  • Figure out a clean way to omit the single KVM device we have from the PuppetDB report
  • Probably drop warnings from PuppetDB report
  • Blacklist sites esams and knams (for now) from Coherence reports
  • Blacklist certain roles (at least Cable management, Storage bin, Optical device; possible others) from serial number checks
  • Add check for devices with Status: Offline, that have row/rack assigned
  • Add check for devices with Status not in (Offline, Planned) but with no row/rack assigned (Update: there are 43 devices that are Status: Inventory but clearly in storage; not sure how we want to handle this right now)
  • Make asset tag and task checking needs a bit smarter (e.g. catch cases like ticket: T2). Should probably check for WMF\d{4,} and T\d{5,} or something.
  • Change log_info and log lines to log_success if they are infact successes so we can have some nice green lines even if they are summaries.
  • Message improvements (inconsistent trailing period, missing purchase date" vs. "serial missing" vs. "bad procurement ticket: None", s/bad/malformed/ etc.)
  • Add a PuppetDB sub-report that checks the device type vs. the productname fact. Same for manufacturer but that has extra gotchas (Dell's is self-reported "Dell Inc."; HP was renamed to "HPE" at some point).
  • Filter out ADMIN_down Ganeti VMs from the PuppetDB report.

The plan is to enable an alert for these reports once we get them to only error on actual errors.

Event Timeline

OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).

The Coherence report in its current state is not super useful. Issues to fix:

  • Blacklist sites esams and knams for now; these are known to be wildly inconsistent (or should I say incoherent? :) and they are unfortunately not actionable right now. We can re-enable once we do a big cleanup project there, hopefully in a few months.
  • Blacklist certain roles, which are known to miss data and we accept that's OK. I can think of at least: Cable management, Storage bin, Optical device. Possibly others.
  • We should also probably exclude Status: Offline. If the asset is to offline and unracked, we probably don't care at that point.
  • We should probably do a s/^wmf/WMF/ in asset tags, and s/^t/T/ in tasks for now, until that report becomes actionable and DC Ops starts monitoring this report.
  • Asset tag and task checking needs to be a bit smarter. There is a device with task set to "T2" that doesn't alert right now. Should probably check for WMF\d{4,} and T\d{6} or something.
  • Nitpicking, but the log messages need improvement:
    • inconsistent trailing period
    • "missing purchase date" vs. "serial missing" vs. "bad procurement ticket: None"
    • s/bad/malformed/ or something like that
    • (from another report) "only unconnected console ports are present" - what does that mean?
  • We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Finally, food for thought: with these coherence issues fixed up, we can be reasonably sure that we have data that looks to be formatted properly, but we can't be sure whether it's bogus (e.g. an asset tag, Phabricator task or purchase date typo). Ideas on how to detect these:

  • Assets with the same procurement task, should probably have the same purchase date. I wonder if that's the case in the wild though; I know that in some cases we had multiple shipments for the same order and may have entered different invoicing dates. Needs to be checked before it gets into the report.
  • <insert crazy idea about checking whether the Phabricator task is marked with #procurement>
  • ...?

OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).

The Coherence report in its current state is not super useful. Issues to fix:

  • Blacklist sites esams and knams for now; these are known to be wildly inconsistent (or should I say incoherent? :) and they are unfortunately not actionable right now. We can re-enable once we do a big cleanup project there, hopefully in a few months.
  • Blacklist certain roles, which are known to miss data and we accept that's OK. I can think of at least: Cable management, Storage bin, Optical device. Possibly others.

ack

  • We should also probably exclude Status: Offline. If the asset is to offline and unracked, we probably don't care at that point.

ack

  • We should probably do a s/^wmf/WMF/ in asset tags, and s/^t/T/ in tasks for now, until that report becomes actionable and DC Ops starts monitoring this report.

Reasonable.

  • Asset tag and task checking needs to be a bit smarter. There is a device with task set to "T2" that doesn't alert right now. Should probably check for WMF\d{4,} and T\d{6} or something.

ack

  • Nitpicking, but the log messages need improvement:
    • inconsistent trailing period
    • "missing purchase date" vs. "serial missing" vs. "bad procurement ticket: None"
    • s/bad/malformed/ or something like that

+1

  • (from another report) "only unconnected console ports are present" - what does that mean?

The situation where no consoles are defined is different from where there are console ports but no connections on them. To be honest this is unnecessary complexity, since both situations are equivalent.

  • We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Reasonable, straight forward.

Finally, food for thought: with these coherence issues fixed up, we can be reasonably sure that we have data that looks to be formatted properly, but we can't be sure whether it's bogus (e.g. an asset tag, Phabricator task or purchase date typo). Ideas on how to detect these:

  • Assets with the same procurement task, should probably have the same purchase date. I wonder if that's the case in the wild though; I know that in some cases we had multiple shipments for the same order and may have entered different invoicing dates. Needs to be checked before it gets into the report.

Interesting, I think about that one.

  • <insert crazy idea about checking whether the Phabricator task is marked with #procurement>

hoo boy.

I forgot another one, the opposite of this:

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

i.e. alert on any devices with status not in (offline, planned) but with no row/rack assigned :)

I forgot another one, the opposite of this:

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

i.e. alert on any devices with status not in (offline, planned) but with no row/rack assigned :)

That makes sense, should be pretty straight forward. You want this in the coherence checks?

That makes sense, should be pretty straight forward. You want this in the coherence checks?

Yup, that seems to fit there!

Thanks for refiguring the checklist :)

Change 504367 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-reports@master] coherence report: General improvements and rack checks

https://gerrit.wikimedia.org/r/504367

Can we have https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ report for test_netbox_in_puppetdb also list the state of the server next to the hostname? Having that in the report would help prioritize the kind of correction needed.

staged items not in site.pp are bad, but active items are worse, etc...

robh requests that the status show up in test_netbox_in_puppetdb

exclude esams from console report

OK, a few more comments:

For the Coherence report, I did a big audit and all devices (offline and online) should have purchase date + ticket + asset tag now. Exceptions are esams/knams, plus real errors that should be fixed. So:

  • the by-role blacklist (Cable management/Storage bin/Optical device) should not apply in those three fields, or in other words, apply only to the serial number, since these assets do not have any.
  • As a direct consequence of the above, the previously mentioned blacklisting of Status: Offline should not be applied anymore.

For the PuppetDB report:

  • I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.
  • We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

Make asset tag and task checking needs a bit smarter (e.g. catch cases like ticket: T2). Should probably check for WMF\d{4,} and T\d{5,} or something.

This is relevant to prevent entering wrong data in the first place: https://github.com/digitalocean/netbox/issues/609
Opened in 2016 though.

For the PuppetDB report:

  • I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

Excluding from sync (as a missing machine) would prevent them from showing up in the report, it is true. Perhaps using that to set the machine status instead would be a better way, so the machine would be present just with a status that we could exclude.

  • We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

After a little bit of digging, this is an additional field to ship from PuppetDB microapi? Facter doesn't provide an additional API does it?

Change 506001 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-reports@master] Minor improvements to PuppetDB report

https://gerrit.wikimedia.org/r/506001

Change 506008 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-reports@master] Cleanups to the oldhardware report

https://gerrit.wikimedia.org/r/506008

Change 506017 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-reports@master] Minor improvements to management console report

https://gerrit.wikimedia.org/r/506017

For the PuppetDB report:

  • I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

Excluding from sync (as a missing machine) would prevent them from showing up in the report, it is true. Perhaps using that to set the machine status instead would be a better way, so the machine would be present just with a status that we could exclude.

Unfortunately there is no Status field for VMs. We could always add it as a custom field, with our own options for it, to be maintained separately... I'm legitimately unsure whether that's something we should do or not.

  • We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

After a little bit of digging, this is an additional field to ship from PuppetDB microapi? Facter doesn't provide an additional API does it?

Correct :)

I found (and corrected) two devices yesterday that had a purchase date of 2020-MM-DD. Let's add a simple check for "purchase date is in the future" to catch and avoid those :)

Change 504367 merged by CRusnov:
[operations/software/netbox-reports@master] coherence report: General improvements and rack checks

https://gerrit.wikimedia.org/r/504367

crusnov updated the task description. (Show Details)

Coherence report quality pass is deployed.

(Not sure if I should be piling on this never-ending task!)

The oldhardware report isn't super useful right now. It does what it was designed to do, but due to various factors (refresh rate varies per type of device; sometimes we go over 5y; sometimes the replacement is already there and it's about to be executed etc.) it's not super useful as a tool for DC Ops to be alerted on and triage. We typically do this audit once a year during budgeting anyway, but at that point we're looking one year ahead. We also track this regularly as part of the budget execution, so having this as a report isn't of particular help to me as budget owner, and I think neither to DC Ops.

(Oh, and the "null purchase date" portion of it is now redundant -- the Coherence check already alerts us of this.)

One thing that /could/ be useful for this report though, would be to report on hardware > 4 or 4.5 years that is Status: Planned. Basically: once a spare device is sufficiently old that it would not make sense to provision anything new on it, to alert us so that we can transition it to Status: Inventory and eventually Offline. Does that make sense?

  • We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

Given how unique models are across vendors, I'm leaning towards not adding a manufacturer check -- more trouble than it's worth.

As for model mismatches: cloudvirt1026-cloudvirt1030 are good examples of an error there (they're R640, but documented as R630).

Change 506001 merged by CRusnov:
[operations/software/netbox-reports@master] Minor improvements to PuppetDB report

https://gerrit.wikimedia.org/r/506001

Change 506017 merged by CRusnov:
[operations/software/netbox-reports@master] Minor improvements to management console report

https://gerrit.wikimedia.org/r/506017

Change 506008 merged by CRusnov:
[operations/software/netbox-reports@master] Cleanups to the oldhardware report

https://gerrit.wikimedia.org/r/506008

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

You are correct that admin_state is the desired state, while oper_state is the operating one; basically, if your desired state is "this VM should be up" and for some reason (e.g. the host had an unscheduled reboot, or QEMU crashed etc.) the VM is down, then admin != oper, and a cronjob that runs the ganeti "watcher" will execute and fix things up (= start the VM). Until that happens, gnt-instance list will list the VM with a status of "ERROR_down", meaning "it's down, but it shouldn't be".

For the purposes of Netbox, I think what we need is the admin state, not the operating one, i.e. what we've configured Ganeti to do, rather than what has actually happened due to an error. The equivalent for a physical host is that we expect a Status: Offline host to be powered down, and a Status: Active host to be powered up, but we don't really track whether we've shut down or powercycled a host manually.

Does that make sense and do you agree?

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

You are correct that admin_state is the desired state, while oper_state is the operating one; basically, if your desired state is "this VM should be up" and for some reason (e.g. the host had an unscheduled reboot, or QEMU crashed etc.) the VM is down, then admin != oper, and a cronjob that runs the ganeti "watcher" will execute and fix things up (= start the VM). Until that happens, gnt-instance list will list the VM with a status of "ERROR_down", meaning "it's down, but it shouldn't be".

For the purposes of Netbox, I think what we need is the admin state, not the operating one, i.e. what we've configured Ganeti to do, rather than what has actually happened due to an error. The equivalent for a physical host is that we expect a Status: Offline host to be powered down, and a Status: Active host to be powered up, but we don't really track whether we've shut down or powercycled a host manually.

Does that make sense and do you agree?

Yes I agree, thank you for the additional context.

Mentioned in SAL (#wikimedia-operations) [2019-05-21T15:01:10Z] <crusnov@deploy1001> Started deploy [netbox/deploy@3091b51]: deploy new version of ganeti-netbox sync - T220422

Mentioned in SAL (#wikimedia-operations) [2019-05-21T15:02:05Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@3091b51]: deploy new version of ganeti-netbox sync - T220422 (duration: 00m 55s)

crusnov updated the task description. (Show Details)

Merged the change and deployed which uses admin_state instead.