Page MenuHomePhabricator

Netbox Alert Cleanups
Closed, ResolvedPublic

Description

  • Fix check messages so they don't get so mangled (in netbox split patch this is fixed).
  • Add information to the check_url (it's in Wikitech, but we could also link the report results).
  • Make a dcops contact group because most of these are dcops actionability.

Event Timeline

crusnov added a project: User-crusnov.

I am not sure this is related, but we get many alerts of

  • PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
  • PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed

If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.

I am not sure this is related, but we get many alerts of

  • PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
  • PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed

If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.

The first one is T237803

I have downtimed some of the alerts, but it will expire in a couple of hours from now

Change 550741 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-deploy@master] ganeti-sync: Add retries to api calls

https://gerrit.wikimedia.org/r/550741

Change 550741 merged by CRusnov:
[operations/software/netbox-deploy@master] ganeti-sync: Add retries to api calls

https://gerrit.wikimedia.org/r/550741

What's the latest here? Please keep the task updated :)

crusnov updated the task description. (Show Details)

What's the latest here? Please keep the task updated :)

  • Contact group is in place and working.
  • All reports alert to dcops channel.
  • URL is rearranged to the report URL.