Anytime an Icinga check is generated based on some threshold on a Grafana/Graphite/Prometheus(?) graph/timeseries, we should mandatorily have a graph with the same monitored metric saved in Grafana, and the direct link to the specific graph should be included in a way so that it will be printed also on IRC by ircecho. The description of the check seems a good candidate to hold the link.
Description
Details
Related Objects
- Mentioned In
- T184714: Puppet fail to properly refresh Icinga
Event Timeline
I've started working on this, I hoped to be able to finish it by today but the list of checks is long. I will complete it when I'll be back.
Change 391235 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: allow to set display_name
Change 391236 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: add link to the Grafana dashboard
Change 391237 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga notification: use display_name in messages
Change 391238 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: make link to Grafana mandatory
Change 391525 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Grafana: add graph to Swift dashboard
Change 391235 merged by Volans:
[operations/puppet@production] Icinga: allow to set display_name
Change 391525 merged by Volans:
[operations/puppet@production] Grafana: add graph to Swift dashboard
Change 391236 merged by Volans:
[operations/puppet@production] Metric alarms: add link to the Grafana dashboard
Change 392405 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] graphite: fix typo
Change 392408 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] icinga: convert display_name in notes_url
Change 392409 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] icinga: remove display_name
Change 392408 merged by Volans:
[operations/puppet@production] icinga: convert display_name in notes_url
Change 392409 merged by Volans:
[operations/puppet@production] icinga: remove display_name
Change 392606 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga web: add icons for multiple notes_url items
Change 392607 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: convert dashboad_link to array
Change 391237 merged by Volans:
[operations/puppet@production] Icinga notification: use notes_url in messages
Change 391238 merged by Volans:
[operations/puppet@production] Metric alarms: make link to Grafana mandatory
Change 392630 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga notes_url: do not pre-encode the URLs
Change 392630 merged by Volans:
[operations/puppet@production] Icinga notes_url: do not pre-encode the URLs
Change 392631 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: fix dashboard link validation
Change 392631 merged by Volans:
[operations/puppet@production] Metric alarms: fix dashboard link validation
Change 392606 merged by Volans:
[operations/puppet@production] Icinga web: add icons for multiple notes_url items
Change 392607 merged by Volans:
[operations/puppet@production] Metric alarms: convert dashboad_link to array
Change 397759 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Varnish instance: fix child restarted check
Change 397759 merged by Volans:
[operations/puppet@production] Varnish instance: fix child restarted check
Change 397796 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: use the env variable instead of the macro
Change 397796 merged by Volans:
[operations/puppet@production] Icinga: use the env variable instead of the macro
Change 398037 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: fine tune settings for dashboard links
Change 398037 merged by Volans:
[operations/puppet@production] Icinga: fine tune settings for dashboard links
Change 398054 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: escape $ not in macro
Change 398054 merged by Volans:
[operations/puppet@production] Icinga: escape $ not in macro
Mentioned in SAL (#wikimedia-operations) [2017-12-21T17:01:19Z] <volans> debugging Icinga notes_url (no side effect expected but logging it in case there will be) T170353
To summarize the current status, everything is deployed and works as expected, except one small detail: the ampersand are removed from the dashboard URLs, making them mostly useless :/
After quite some debugging and code reading, this is my current understanding:
- on tegmen I cannot repro this behaviour, the URL are parsed and printed correctly, even setting the configuration the same of einsteinium.
- from the debug logging I don't see any differences between the two hosts, but the result is different (see below)
- the Icinga version is the same in both hosts
- from the code and the configuration it seems that tegmen behaviour is the correct one, but I'm not 100% sure, need to dig a bit more.
- my next test suggestion would be to make tegmen active and see if the roles get inverted or not.
Here you can find the debug output of the processed macro in tegmen (where the ampersands are not removed):
[1513878369.402851] [2048.2] [pid=2339] Processing part: 'SERVICENOTESURL' [1513878369.402853] [2048.2] [pid=2339] macros[78] (SERVICENOTESURL) match. [1513878369.402855] [2048.2] [pid=2339] New clean options: 4 [1513878369.402857] [2048.1] [pid=2339] **** BEGIN MACRO PROCESSING *********** [1513878369.402859] [2048.1] [pid=2339] Processing: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513878369.402861] [2048.2] [pid=2339] Processing part: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513878369.402863] [2048.2] [pid=2339] Not currently in macro. Running output (179): ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513878369.402865] [2048.1] [pid=2339] Done. Final output: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513878369.402867] [2048.1] [pid=2339] **** END MACRO PROCESSING ************* [1513878369.402869] [2048.2] [pid=2339] Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1, Value: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513878369.402871] [2048.2] [pid=2339] Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1 [1513878369.402873] [2048.2] [pid=2339] Cleaning options: global=3, local=4, effective=7 [1513878369.402878] [2048.2] [pid=2339] Cleaned macro. Running output (292): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen%27' [1513878369.402880] [2048.2] [pid=2339] Just finished macro. Running output (292): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen%27'
And here the same output from einsteinium, where the 4 ampersands are removed from the URL, despite the same cleaning options:
[1513875984.379768] [2048.2] [pid=1194] Processing part: 'SERVICENOTESURL' [1513875984.379772] [2048.2] [pid=1194] macros[78] (SERVICENOTESURL) match. [1513875984.379776] [2048.2] [pid=1194] New clean options: 4 [1513875984.379780] [2048.1] [pid=1194] **** BEGIN MACRO PROCESSING *********** [1513875984.379784] [2048.1] [pid=1194] Processing: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513875984.379789] [2048.2] [pid=1194] Processing part: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513875984.379793] [2048.2] [pid=1194] Not currently in macro. Running output (179): ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513875984.379797] [2048.1] [pid=1194] Done. Final output: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513875984.379802] [2048.1] [pid=1194] **** END MACRO PROCESSING ************* [1513875984.379806] [2048.2] [pid=1194] Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1, Value: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen'' [1513875984.379810] [2048.2] [pid=1194] Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1 [1513875984.379814] [2048.2] [pid=1194] Cleaning options: global=3, local=4, effective=7 [1513875984.379822] [2048.2] [pid=1194] Cleaned macro. Running output (288): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27' [1513875984.379827] [2048.2] [pid=1194] Just finished macro. Running output (288): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27'
Change 403369 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Temporary failover Icinga to tegmen
Change 403370 had a related patch set uploaded (by Volans; owner: Volans):
[operations/dns@master] Temporary failover Icinga to tegmen
Change 403369 merged by Volans:
[operations/puppet@production] Temporary failover Icinga to tegmen
Mentioned in SAL (#wikimedia-operations) [2018-01-10T11:07:57Z] <volans> start failovering of Icinga to tegmen - T170353
Change 403370 merged by Volans:
[operations/dns@master] Temporary failover Icinga to tegmen
Mentioned in SAL (#wikimedia-operations) [2018-01-10T11:19:31Z] <volans> Icinga failover to tegmen completed - T170353
Confirmed that on tegmen it works fine after failovering the active Icinga server to it.
The links are properly rendered and the ampersends are not dropped, as opposed to what happens on einsteinium.
Wed 12:23:28 icinga-wm| PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - T170353
- volans https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen
https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreenMentioned in SAL (#wikimedia-operations) [2018-01-11T12:28:21Z] <volans> Start Icinga failover back to einsteinium - T170353
Mentioned in SAL (#wikimedia-operations) [2018-01-11T12:39:27Z] <volans> Icinga failover back to einsteinium completed - T170353
TL;DR: Everything is back to einsteinium now, and everything is working. Resolving.
The only possible explanation I have right now PEBCAK on my side, that during my previous tests I might have missed to restart Icinga but just reload it, although I remember to have restarted it :( and @akosiaris telling me to do it.
But this would mean that our "refresh" via Puppet when changing the configuration is not doing the right thing, given that the configuration was changed via Puppet and apparently was not applied. A reload might be ok for services/hosts changes but for more deeper configuration changes a full restart seems to be needed. I'll file a task for it.
I've done a quick test of re-adding the & to the illegal_macro_output_chars setting and restarting Icinga and it got removed from the URL, as expected.
The only positive thing is that the failover happened at the right time, that we needed to reboot both hosts anyway.