Page MenuHomePhabricator

Improve access to and control over incident and metrics monitoring infrastructure
Closed, DeclinedPublic

Description

We want to revamp what our monitoring infrastructure presents to our users. To do this we will:

  • evaluate alternative web interfaces to icinga 1 core
  • Migrate to a web interfaces with improved access control over service checks and their state
  • Implement email & paging alerts for ALL service owners
  • consolidate graphite metrics monitoring frontends into grafana

Event Timeline

akosiaris raised the priority of this task from to Medium.
akosiaris updated the task description. (Show Details)
akosiaris subscribed.

Not sure if this is in-scope here, but as part of T103124, we had hoped to separate Icinga notifications for the RESTBase staging/test environment. Since those hosts are not production, ideally Services would receive (low priority) notification of failures, without spaming the admins group.

fgiunchedi subscribed.

Declining as these points are covered by the alerting roadmap. Feel free to reopen if needed!