Page MenuHomePhabricator

metricsinfra: Build out default alert rules
Open, MediumPublic

Description

The default rules currently only alert on instance down / puppet failure. I imagine we'll want to add other defaults as well to detect general malfunctions on the instances:

  • disk space
  • mail queue length
  • ntp time sync status
  • systemd unit monitoring? at least for some critical services
  • cpu / ram usage?
  • network usage?