Page MenuHomePhabricator

Netbox report accounting icinga alert
Closed, ResolvedPublic

Description

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netbox1001&service=Netbox+report+accounting

The alert Netbox report accounting has been alerting as critical for 41 days.

ACKing it and referencing that task.

Event Timeline

ayounsi created this task.
wiki_willy added a project: ops-eqiad.
wiki_willy added a subscriber: Jclark-ctr.

Hey @Jclark-ctr - per our conversation from last Thursday, can you work on fixing these following Netbox errors for eqiad when you go onsite this week?

https://netbox.wikimedia.org/extras/reports/accounting.Accounting/
https://netbox.wikimedia.org/extras/reports/coherence.Rack/

Thanks,
Willy

@wiki_willy fixed accounting report

coherence report only remaining for eqiad is

flerovium-array2 will have to check U on site tomorrow

Fixed error in Netbox for flerovium-array2. @Jclark-ctr - once you have msw-a2-eqiad added into Julianne's spreadsheet (at the top in line 8) and fix the duplicate cable labels on https://netbox.wikimedia.org/dcim/cables/1585/ and https://netbox.wikimedia.org/dcim/cables/1587/ , then you can close out this request. Thanks, Willy

@wiki_willy updated cable id numbers...please verify and resolve this

Looks good now @Cmjohnson. Resolving task

Alerting for 6 days.
Right now says: backup1002 backup1002-array not present in accounting.

@ayounsi - I think the alert is being triggered from the Finance spreadsheet:

https://docs.google.com/spreadsheets/d/11xbHX7lRzglFYc85kvmtOjssOfm3tCkxH7cYr7ImFbk/edit#gid=0

Once they populate it with the purchases from May, the alert should go away.

Thanks,
Willy

Re-opening that task to ACK the alert it in Icinga, it has been cluttering the active alert list for 64 days :)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netbox1001&service=Netbox+report+accounting

@ayounsi - what exactly needs to be done from dc-ops? When I look at the Netbox accounting report, everything appears to be be alerting because the accounting team hasn't updated their spreadsheet since June. The 3 on the top alerting with s/n = x is because the accounting team has some RMA'd PDUs listed on their spreadsheet, which we've already returned. (which is why they're not in Netbox) The only one that I see legit is probably the mx204 router with s/n = SV5533, which I don't recall alerting when this task was initially closed in June.

I re-opened it instead of opening a new one to keep context. And It's better to discuss it than have that long standing alert.

I think it depends on what we have control over, I don't know how the accounting part works. Some thoughts:
Maybe DCops should ask accounting what's going on and what we should expect? That report used to be green most of the time, so maybe something (a process, etc) changed?
Maybe we should edit the accounting report to ignore too recent items?
Maybe we should not have an Icinga check for that report?

Yes, my issue with this accounting report is that while it's useful to compare the discrepancies between Netbox and the Accounting spreadsheet, it's constantly red because we'll add new installs into Netbox, while it's been taking the accounting report 2+ months to catch up. It's been like this for 2-3 quarters now. If we can ignore the bottom portion of this report so that it doesn't trigger alerts (maybe code it yellow, like the cable report), then we focus on just the Netbox errors that are controlable within dc-ops.

Broadly speaking:

  • We shouldn't have outstanding alerts open (or even acknowledged) for more than a few days. If there is an alert, it means there is an abnormal condition that requires fixing. If the issues require a significant amount of work to address, then a a task should be created and the alert acknowledged with the task in the comment while it's getting fixed. I'd expect the DC Ops teams to be primary for such alerts and act on them, but also everyone in SRE is expected to triage alerts and reach out to owners and file tasks about them (like @ayounsi did here)
  • If there are false positives often, then this is something that we should fix. We probably need one or more separate task for this, that describes conditions under which an alert is triggered erroneously, so that we can fix this. I'd expect the DC Ops team to be filing this task, and I/F to change the report to meet the adjusted needs.
  • The test_missing_assets_from_accounting report is already (and has always been) ignoring discrepancies for items where the purchase date is in the last 90 days. This is configurable and we can tune it further to some other value but it was picked as long enough for accounting to process invoices, and too long to have fallen out of memory (or vendor engagement is over, team changes etc.). If there is a persistent backlog in Finance >90d it'd be good to know and adjust.

Now spot-checking this, it does seem that the root cause is (just) a lack of recent updates from the AP team:

  • There is a mismatch for cr3-eqsin's serial number between what accounting thinks and what we have in Netbox. If we have the wrong serial this could bite us in the future in renewal contracts etc., so we'd need to resolve this ambiguity and figure out what the right serial is, so the alert here is correct.
  • The items marked with "x" seem to be missing S/Ns for assets we have returned (but also seem to have the wrong task?). If we don't have these S/Ns anymore, we should leave these fields empty, which will clear the report. In any case, not something that will get fixed by the finance team if we don't act on it, so the alert here is also not a false positive I feel.
  • The rest seem to be msws and mr1, which are special in that they are probably not meeting the capitalization threshold. I think last time this occured, a new section was built into the sheet at the top, but it's not super clear to me if that is waiting on Finance or the DC Ops team to fill out. (So, I'm not sure if this is a "lag" from Accounting, or intentional on their end)

Hope this helps!

For the last couple bullets, we (dc-ops) own that, but typically make the change after Accounting has their spreadsheet updated (when we move the <$1k assets to the top of the sheet, along with fixing any other new discrepancies), which is why it has been alerting for 60+ days. cr3-eqsin from the first bullet is valid though, and is fixed now. In general, I haven't been a big fan of how the Netbox errors are reported. An onsite engineer could install a bunch of new hardware one day, not have enough time to check the Netbox reports before they leave, and a week goes by before their next trip onsite...where they end up prioritizing other new tasks over fixing the error. Or if they're already at home updating the Netbox entries, but have to be onsite to verify a mismatch, it also gets pushed to the backburner, as other priorities pop up during their next site visit. These past few weeks, I've also seen Netbox alerts get generated because the Fundraising team changes the status of one of their hosts, when a SRE closes out a decom task without having it go thru dc-ops, or when a host gets depooled by SRE without having its status changed, so I don't think dc-ops is the only cause of these Netbox errors.

I'm just thinking out loud here, but to keep each individual accountable (whether it's dc-ops or someone else) for any Netbox errors created, would it be possible to have the user who made the change directly notified? Either from an autogenerated email or an autogenerated task assigned to them? It would also be a nice feature to have, if there was an additional column on the Netbox report, displaying the username of the individual who last updated the failed line item. If there's something that pops up to get each individual's attention (and stays on their radar until fixed), I'm hoping there's something small we can adjust to make everyone's lives easier - easier for everyone who gets spammed with alerts and easier for the people who need to fix the errors. In the mean time, I'm trying to get a hold of Julianne to find out what's the latest word on the frequency of their accounting spreadsheet, so hopefully I'll have an answer on that soon. Thanks, Willy

In general, I haven't been a big fan of how the Netbox errors are reported. An onsite engineer could install a bunch of new hardware one day, not have enough time to check the Netbox reports before they leave, and a week goes by before their next trip onsite...where they end up prioritizing other new tasks over fixing the error. Or if they're already at home updating the Netbox entries, but have to be onsite to verify a mismatch, it also gets pushed to the backburner, as other priorities pop up during their next site visit.

Just to make sure I understand this: what would be the ideal time for these alerts then? If they techs don't fix these on the day of the change, nor at home, nor at the next trip… that doesn't leave a ton of room :)

These past few weeks, I've also seen Netbox alerts get generated because the Fundraising team changes the status of one of their hosts, when a SRE closes out a decom task without having it go thru dc-ops, or when a host gets depooled by SRE without having its status changed, so I don't think dc-ops is the only cause of these Netbox errors.

I'm just thinking out loud here, but to keep each individual accountable (whether it's dc-ops or someone else) for any Netbox errors created, would it be possible to have the user who made the change directly notified? Either from an autogenerated email or an autogenerated task assigned to them? It would also be a nice feature to have, if there was an additional column on the Netbox report, displaying the username of the individual who last updated the failed line item. If there's something that pops up to get each individual's attention (and stays on their radar until fixed), I'm hoping there's something small we can adjust to make everyone's lives easier - easier for everyone who gets spammed with alerts and easier for the people who need to fix the errors. In the mean time, I'm trying to get a hold of Julianne to find out what's the latest word on the frequency of their accounting spreadsheet, so hopefully I'll have an answer on that soon.

I appreciate a lot trying to come up with solutions & thinking out loud -- this is great! The idea around "last change" is interesting but it's going to prove heard to implement: besides Netbox missing that feature, inconsistencies are often by design checks against external systems, which have their own state (e.g. changes in the the accounting spreadsheet, or runtime state in LibreNMS or the Puppet database etc.).

It's definitely the case that DC Ops are not the only source of inconsistencies here! But, I wouldn't want to make "clean reports" a shared accountability either. I'd like DC Ops to act as primary here, and do the first line triaging and either fixing directly, or raising this with the appropriate party e.g. in a task --whether an SRE team, a Tech team or Finance etc.-- and if it doesn't get fixed promptly, escalating as necessary. That way, the team could maintain the big picture, and start identifying some patterns of erroneous actions and false positives, and fix issues as the team comes across them. Perhaps even come up with new tests and reports to perform! Does that make sense?

To your first question, I was hoping there could be some type of autogenerated task that assigns each dc engineer a Phabricator task by data center site. The idea is that it would assign ownership by attaching a name to each specific alert, and be visible as a constant reminder in that individual's queue everyday. Currently, there's some ambiguity that exists, where an individual isn't clear if he/she or someone else (accounting, service owner, etc) has the responsibility of fixing certain errors on the Netbox report. So the individual(s) may fix what they believe is theirs, and the remaining ends up being in limbo for a while. But if there's an open task that shows exactly which part of the error report they're responsible fixing, I think this would provide more clarity, thus shortening the turnaround time.

From my understanding now, none of this can be automated (or would at least be very complex to implement - whether it's displaying the username of last change or autogenerating tasks to individuals), so we'll start taking care of it manually. Though not ideal, we can create the subtasks ourselves to divide up the errors, pinpoint who made the change, and then go about resolving from there. I think we can do this for all the reports, except for puppetdb.VirtualMachines. puppetdb.PhysicalHosts is a bit tough as well, since it fluctuates on and off quite a bit as we (and other SREs) install new hosts, so I almost wonder if there should be a delay set for the alerts under this report bucket.

If possible though, I would like to ability to pull some data around what we're doing. For example, being able to go back and pull the number of Netbox errors per week, per month, per site, to see how we're trending - are we improving, what are the numbers, which users are creating the most errors, average time to fix, etc. I think the trending from this data would show a more complete story vs. the general perception that the Netbox reports are constantly in a failed state.

there could be some type of auto-generated task that assigns each DC engineer a Phabricator task

That's something we want to have for a while, see for example T225140. But as Faidon mentioned, it's not always easy to figure out who caused the issue/miss-match due to the various systems at play. For example the person who last changed the device state (visible in the changelog tab) might not be the person who removed a host from Puppet. But maybe it would be close enough?

If possible though, I would like to ability to pull some data around what we're doing.

I opened T262898 for this.

Thanks @ayounsi, much appreciated for creating T262898. Glad the Netbox graphs won't be too hard to generate. Also, just a follow-up from my action item earlier, I chatted with Julianne and it looks like she's behind on updating last month's data for their spreadsheet. And the reason why it wasn't done for July was because we didn't have any orders that month, which will probably be the case more often going forward, as we tighten up the timeline around procurement orders. So with this info, we'll start fixing these Accounting errors as part of our regular subtasks for Netbox errors. @Jclark-ctr says he'll finish up everything listed in this accounting report by end of the week. Thanks, Willy

Resolved the Netbox error alerts. Closing task.