Page MenuHomePhabricator

Icinga: timeseries checks should have the link to a graph with the data
Closed, ResolvedPublic

Description

Anytime an Icinga check is generated based on some threshold on a Grafana/Graphite/Prometheus(?) graph/timeseries, we should mandatorily have a graph with the same monitored metric saved in Grafana, and the direct link to the specific graph should be included in a way so that it will be printed also on IRC by ircecho. The description of the check seems a good candidate to hold the link.

Event Timeline

Volans created this task.Jul 11 2017, 9:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 11 2017, 9:59 PM
greg awarded a token.Jul 11 2017, 10:17 PM
greg added a subscriber: greg.
faidon triaged this task as Normal priority.Jul 20 2017, 1:17 PM
faidon assigned this task to Volans.Jul 24 2017, 3:26 PM
faidon moved this task from Backlog to In progress on the observability board.
Volans added a comment.Aug 3 2017, 1:56 PM

I've started working on this, I hoped to be able to finish it by today but the list of checks is long. I will complete it when I'll be back.

elukey added a subscriber: elukey.Aug 17 2017, 8:58 AM
faidon moved this task from In progress to Up next on the observability board.Sep 6 2017, 3:02 PM
faidon moved this task from Up next to Backlog on the observability board.Sep 25 2017, 4:43 PM
faidon moved this task from Backlog to Up next on the observability board.Oct 2 2017, 3:38 PM
Volans moved this task from Up next to In progress on the observability board.Oct 30 2017, 2:56 PM

Change 391235 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: allow to set display_name

https://gerrit.wikimedia.org/r/391235

Change 391236 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: add link to the Grafana dashboard

https://gerrit.wikimedia.org/r/391236

Change 391237 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga notification: use display_name in messages

https://gerrit.wikimedia.org/r/391237

Change 391238 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: make link to Grafana mandatory

https://gerrit.wikimedia.org/r/391238

Change 391525 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Grafana: add graph to Swift dashboard

https://gerrit.wikimedia.org/r/391525

Change 391235 merged by Volans:
[operations/puppet@production] Icinga: allow to set display_name

https://gerrit.wikimedia.org/r/391235

Change 391525 merged by Volans:
[operations/puppet@production] Grafana: add graph to Swift dashboard

https://gerrit.wikimedia.org/r/391525

Change 391236 merged by Volans:
[operations/puppet@production] Metric alarms: add link to the Grafana dashboard

https://gerrit.wikimedia.org/r/391236

Change 392405 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] graphite: fix typo

https://gerrit.wikimedia.org/r/392405

Change 392405 merged by Volans:
[operations/puppet@production] graphite: fix typo

https://gerrit.wikimedia.org/r/392405

Change 392408 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] icinga: convert display_name in notes_url

https://gerrit.wikimedia.org/r/392408

Change 392409 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] icinga: remove display_name

https://gerrit.wikimedia.org/r/392409

Change 392408 merged by Volans:
[operations/puppet@production] icinga: convert display_name in notes_url

https://gerrit.wikimedia.org/r/392408

Change 392409 merged by Volans:
[operations/puppet@production] icinga: remove display_name

https://gerrit.wikimedia.org/r/392409

Change 392606 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga web: add icons for multiple notes_url items

https://gerrit.wikimedia.org/r/392606

Change 392607 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: convert dashboad_link to array

https://gerrit.wikimedia.org/r/392607

Change 391237 merged by Volans:
[operations/puppet@production] Icinga notification: use notes_url in messages

https://gerrit.wikimedia.org/r/391237

Change 391238 merged by Volans:
[operations/puppet@production] Metric alarms: make link to Grafana mandatory

https://gerrit.wikimedia.org/r/391238

Change 392630 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga notes_url: do not pre-encode the URLs

https://gerrit.wikimedia.org/r/392630

Change 392630 merged by Volans:
[operations/puppet@production] Icinga notes_url: do not pre-encode the URLs

https://gerrit.wikimedia.org/r/392630

Change 392631 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Metric alarms: fix dashboard link validation

https://gerrit.wikimedia.org/r/392631

Change 392631 merged by Volans:
[operations/puppet@production] Metric alarms: fix dashboard link validation

https://gerrit.wikimedia.org/r/392631

Change 392606 merged by Volans:
[operations/puppet@production] Icinga web: add icons for multiple notes_url items

https://gerrit.wikimedia.org/r/392606

Change 392607 merged by Volans:
[operations/puppet@production] Metric alarms: convert dashboad_link to array

https://gerrit.wikimedia.org/r/392607

Change 397759 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Varnish instance: fix child restarted check

https://gerrit.wikimedia.org/r/397759

Change 397759 merged by Volans:
[operations/puppet@production] Varnish instance: fix child restarted check

https://gerrit.wikimedia.org/r/397759

Change 397796 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: use the env variable instead of the macro

https://gerrit.wikimedia.org/r/397796

Change 397796 merged by Volans:
[operations/puppet@production] Icinga: use the env variable instead of the macro

https://gerrit.wikimedia.org/r/397796

Change 398037 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: fine tune settings for dashboard links

https://gerrit.wikimedia.org/r/398037

Change 398037 merged by Volans:
[operations/puppet@production] Icinga: fine tune settings for dashboard links

https://gerrit.wikimedia.org/r/398037

Change 398054 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: escape $ not in macro

https://gerrit.wikimedia.org/r/398054

Change 398054 merged by Volans:
[operations/puppet@production] Icinga: escape $ not in macro

https://gerrit.wikimedia.org/r/398054

Mentioned in SAL (#wikimedia-operations) [2017-12-21T17:01:19Z] <volans> debugging Icinga notes_url (no side effect expected but logging it in case there will be) T170353

To summarize the current status, everything is deployed and works as expected, except one small detail: the ampersand are removed from the dashboard URLs, making them mostly useless :/

After quite some debugging and code reading, this is my current understanding:

  • on tegmen I cannot repro this behaviour, the URL are parsed and printed correctly, even setting the configuration the same of einsteinium.
  • from the debug logging I don't see any differences between the two hosts, but the result is different (see below)
  • the Icinga version is the same in both hosts
  • from the code and the configuration it seems that tegmen behaviour is the correct one, but I'm not 100% sure, need to dig a bit more.
  • my next test suggestion would be to make tegmen active and see if the roles get inverted or not.

Here you can find the debug output of the processed macro in tegmen (where the ampersands are not removed):

[1513878369.402851] [2048.2] [pid=2339]   Processing part: 'SERVICENOTESURL'
[1513878369.402853] [2048.2] [pid=2339]   macros[78] (SERVICENOTESURL) match.
[1513878369.402855] [2048.2] [pid=2339]   New clean options: 4
[1513878369.402857] [2048.1] [pid=2339] **** BEGIN MACRO PROCESSING ***********
[1513878369.402859] [2048.1] [pid=2339] Processing: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513878369.402861] [2048.2] [pid=2339]   Processing part: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513878369.402863] [2048.2] [pid=2339]   Not currently in macro.  Running output (179): ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513878369.402865] [2048.1] [pid=2339]   Done.  Final output: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513878369.402867] [2048.1] [pid=2339] **** END MACRO PROCESSING *************
[1513878369.402869] [2048.2] [pid=2339]   Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1, Value: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513878369.402871] [2048.2] [pid=2339]   Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1
[1513878369.402873] [2048.2] [pid=2339]   Cleaning options: global=3, local=4, effective=7
[1513878369.402878] [2048.2] [pid=2339]   Cleaned macro.  Running output (292): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans   %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen%27'
[1513878369.402880] [2048.2] [pid=2339]   Just finished macro.  Running output (292): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans   %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen%27'

And here the same output from einsteinium, where the 4 ampersands are removed from the URL, despite the same cleaning options:

[1513875984.379768] [2048.2] [pid=1194]   Processing part: 'SERVICENOTESURL'
[1513875984.379772] [2048.2] [pid=1194]   macros[78] (SERVICENOTESURL) match.
[1513875984.379776] [2048.2] [pid=1194]   New clean options: 4
[1513875984.379780] [2048.1] [pid=1194] **** BEGIN MACRO PROCESSING ***********
[1513875984.379784] [2048.1] [pid=1194] Processing: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513875984.379789] [2048.2] [pid=1194]   Processing part: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513875984.379793] [2048.2] [pid=1194]   Not currently in macro.  Running output (179): ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513875984.379797] [2048.1] [pid=1194]   Done.  Final output: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513875984.379802] [2048.1] [pid=1194] **** END MACRO PROCESSING *************
[1513875984.379806] [2048.2] [pid=1194]   Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1, Value: ''https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen' 'https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen''
[1513875984.379810] [2048.2] [pid=1194]   Processed 'SERVICENOTESURL', Clean Options: 4, Free: 1
[1513875984.379814] [2048.2] [pid=1194]   Cleaning options: global=3, local=4, effective=7
[1513875984.379822] [2048.2] [pid=1194]   Cleaned macro.  Running output (288): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans   %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27'
[1513875984.379827] [2048.2] [pid=1194]   Just finished macro.  Running output (288): 'echo "PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans   %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27'

Change 403369 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Temporary failover Icinga to tegmen

https://gerrit.wikimedia.org/r/403369

Change 403370 had a related patch set uploaded (by Volans; owner: Volans):
[operations/dns@master] Temporary failover Icinga to tegmen

https://gerrit.wikimedia.org/r/403370

Change 403369 merged by Volans:
[operations/puppet@production] Temporary failover Icinga to tegmen

https://gerrit.wikimedia.org/r/403369

Mentioned in SAL (#wikimedia-operations) [2018-01-10T11:07:57Z] <volans> start failovering of Icinga to tegmen - T170353

Change 403370 merged by Volans:
[operations/dns@master] Temporary failover Icinga to tegmen

https://gerrit.wikimedia.org/r/403370

Mentioned in SAL (#wikimedia-operations) [2018-01-10T11:19:31Z] <volans> Icinga failover to tegmen completed - T170353

Confirmed that on tegmen it works fine after failovering the active Icinga server to it.
The links are properly rendered and the ampersends are not dropped, as opposed to what happens on einsteinium.

Wed 12:23:28   icinga-wm| PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - T170353
         - volans https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen
         https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen

Mentioned in SAL (#wikimedia-operations) [2018-01-11T12:28:21Z] <volans> Start Icinga failover back to einsteinium - T170353

Mentioned in SAL (#wikimedia-operations) [2018-01-11T12:39:27Z] <volans> Icinga failover back to einsteinium completed - T170353

Volans closed this task as Resolved.Jan 11 2018, 12:50 PM
Volans removed a project: Patch-For-Review.

TL;DR: Everything is back to einsteinium now, and everything is working. Resolving.

The only possible explanation I have right now PEBCAK on my side, that during my previous tests I might have missed to restart Icinga but just reload it, although I remember to have restarted it :( and @akosiaris telling me to do it.
But this would mean that our "refresh" via Puppet when changing the configuration is not doing the right thing, given that the configuration was changed via Puppet and apparently was not applied. A reload might be ok for services/hosts changes but for more deeper configuration changes a full restart seems to be needed. I'll file a task for it.
I've done a quick test of re-adding the & to the illegal_macro_output_chars setting and restarting Icinga and it got removed from the URL, as expected.
The only positive thing is that the failover happened at the right time, that we needed to reboot both hosts anyway.