- Fix check messages so they don't get so mangled (in netbox split patch this is fixed).
- Add information to the check_url (it's in Wikitech, but we could also link the report results).
- Make a dcops contact group because most of these are dcops actionability.
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
ganeti-sync: Add retries to api calls | operations/software/netbox-deploy | master | +42 -10 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • crusnov | T221113 Netbox Reports: Create an icinga check for alerting on a set of Netbox reports | |||
Resolved | • crusnov | T224946 Netbox Alert Cleanups |
Event Timeline
Comment Actions
I am not sure this is related, but we get many alerts of
- PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL
- PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed
If those are actionable from dc-ops or if we are getting more false positives than we should, we must fix it.
Comment Actions
I have downtimed some of the alerts, but it will expire in a couple of hours from now
Comment Actions
Change 550741 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-deploy@master] ganeti-sync: Add retries to api calls
Comment Actions
Change 550741 merged by CRusnov:
[operations/software/netbox-deploy@master] ganeti-sync: Add retries to api calls
Comment Actions
- Contact group is in place and working.
- All reports alert to dcops channel.
- URL is rearranged to the report URL.