Page MenuHomePhabricator

Improve AlertManager dashboard
Open, Needs TriagePublic

Description

A bit related to T273716.

Maybe it's because I'm too used to the Icinga dashboard:


But I find the AlertManager dashboard cluttered to the point I can difficultly use it to see what is going on in our infra:

Some changes that *might* make it better (even though I'm not a UX designer):

  • Offer a table view layout instead of a grid, similar to Icinga
  • Don't have boxes inside boxes inside boxes. For example "alertname: Icinga/DPKG" in in a colored box, itself in the tile header. Each box being a different color, it distracts the eyes from the relevant information
  • Remove duplicate information: for example, when sorted by severity, the top container is "severity: critical", but then each alert has "severity: critical" in it, wasting precious real estate
  • Remove unnecessary information: for example,
    • "severity: critical" should just be "critical", the color and the word "critical" is enough to show that it's about severity, showing "severity" in a tool-tip on highlight would be nice too
    • "alertname: Icinga/DPKG" doesn't need the "alertname: Icinga/" part, "alertname" is obvious from the location of the text (like writing title: on a book cover"), "Icinga" should be a tag, it shouldn't matter at this level through which software the alert is coming from
    • The number of impacted hosts (eg. "1" in the red circle) should not be displayed if there is only 1 (or less than 3) impacted host
  • The overview panel doesn't show everything but truncates it with "+7 more"
    • It also has redundant information, all items in the "alertname" category starts with "alertname:"

Event Timeline

I agree with pretty much all the suggestions. Most of them even overlap with some suggestions I had made back in October via email when a preview of it was available ;)

Thank you for the feedback! I'll reply inline, adding that for sure we're lacking training/documentation on how to use the dashboard and certainly I've noticed myself too being used a lot of icinga's UI (for better or worse). The other thing to add is that clicking on the top left number will show you a pop up with a summary of alerts (that can be filtered down by clicking labels)

  • Offer a table view layout instead of a grid, similar to Icinga

This is possible by setting "minimum alert width" option to 800px (the maximum, though I suspect it could be bumped by upstream, and defaults to 800px already). Agreed that the result isn't too much of a table with aligned column though.

  • Don't have boxes inside boxes inside boxes. For example "alertname: Icinga/DPKG" in in a colored box, itself in the tile header. Each box being a different color, it distracts the eyes from the relevant information

I see what you are saying, though the boxes are labels that can be clicked (in the alertname case) or alert groups (the encasing box for a group of alerts). I disagree on the color box though, it is meant to attract the eye e.g. on the alert's name

  • Remove duplicate information: for example, when sorted by severity, the top container is "severity: critical", but then each alert has "severity: critical" in it, wasting precious real estate

I agree, will raise it with upstream

  • Remove unnecessary information: for example,
    • "severity: critical" should just be "critical", the color and the word "critical" is enough to show that it's about severity, showing "severity" in a tool-tip on highlight would be nice too

I like the idea in general of being able to hide label names in some cases, I don't think it is possible ATM but will ask upstream if they are interested in such feature

  • "alertname: Icinga/DPKG" doesn't need the "alertname: Icinga/" part, "alertname" is obvious from the location of the text (like writing title: on a book cover"), "Icinga" should be a tag, it shouldn't matter at this level through which software the alert is coming from

I'm not opposed to moving "icinga" to a tag, though I think for the transition phase it is important we recognize icinga alerts at a glance (e.g. in notifications too). Of course ideally we'd have no icinga alerts :)

  • The number of impacted hosts (eg. "1" in the red circle) should not be displayed if there is only 1 (or less than 3) impacted host

I disagree, there's value in having the information always in the same place

  • The overview panel doesn't show everything but truncates it with "+7 more"

Yes this is on purpose to avoid showing too many alerts in a group

  • It also has redundant information, all items in the "alertname" category starts with "alertname:"

agreed, this is the same feature to hide the label name in some cases as mentioned above I think

Change 698507 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: update dashboard minimum group width to 2048

https://gerrit.wikimedia.org/r/698507

  • Offer a table view layout instead of a grid, similar to Icinga

This is possible by setting "minimum alert width" option to 800px (the maximum, though I suspect it could be bumped by upstream, and defaults to 800px already). Agreed that the result isn't too much of a table with aligned column though.

Bumped in https://gerrit.wikimedia.org/r/c/operations/puppet/+/698507

  • Remove duplicate information: for example, when sorted by severity, the top container is "severity: critical", but then each alert has "severity: critical" in it, wasting precious real estate

I agree, will raise it with upstream

Filed as https://github.com/prymitive/karma/issues/3222

  • Remove unnecessary information: for example,
    • "severity: critical" should just be "critical", the color and the word "critical" is enough to show that it's about severity, showing "severity" in a tool-tip on highlight would be nice too

I like the idea in general of being able to hide label names in some cases, I don't think it is possible ATM but will ask upstream if they are interested in such feature

Filed as https://github.com/prymitive/karma/issues/3221

Change 698507 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: update dashboard minimum group width to 2048

https://gerrit.wikimedia.org/r/698507