Page MenuHomePhabricator

Netbox Alert Cleanups
Closed, ResolvedPublic

Description

  • Fix check messages so they don't get so mangled (in netbox split patch this is fixed).
  • Add information to the check_url (it's in Wikitech, but we could also link the report results).
  • Make a dcops contact group because most of these are dcops actionability.

Details

Related Gerrit Patches:
operations/software/netbox-deploy : masterganeti-sync: Add retries to api calls

Event Timeline

crusnov created this task.Jun 3 2019, 10:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2019, 10:45 PM
crusnov triaged this task as Medium priority.Jun 3 2019, 10:45 PM
crusnov added a project: User-crusnov.
crusnov updated the task description. (Show Details)Jun 3 2019, 10:52 PM
crusnov moved this task from Backlog to In Progress on the User-crusnov board.Jun 17 2019, 11:13 PM
crusnov moved this task from In Progress to Pending on the User-crusnov board.Jul 30 2019, 10:50 PM
crusnov updated the task description. (Show Details)Aug 30 2019, 3:10 PM
crusnov updated the task description. (Show Details)Aug 30 2019, 3:15 PM
jijiki added a subscriber: jijiki.Nov 13 2019, 11:10 AM

I am not sure this is related, but we get many alerts of

  • PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
  • PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed

If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.

I am not sure this is related, but we get many alerts of

  • PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
  • PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed

If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.

The first one is T237803

I have downtimed some of the alerts, but it will expire in a couple of hours from now

Change 550741 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-deploy@master] ganeti-sync: Add retries to api calls

https://gerrit.wikimedia.org/r/550741

Change 550741 merged by CRusnov:
[operations/software/netbox-deploy@master] ganeti-sync: Add retries to api calls

https://gerrit.wikimedia.org/r/550741

What's the latest here? Please keep the task updated :)

crusnov closed this task as Resolved.Nov 20 2019, 3:23 AM
crusnov updated the task description. (Show Details)

What's the latest here? Please keep the task updated :)

  • Contact group is in place and working.
  • All reports alert to dcops channel.
  • URL is rearranged to the report URL.