Page MenuHomePhabricator

Comprehensive monitoring / alerting for labstore* instances
Closed, ResolvedPublic

Description

Should have alerts for any issues with labstore*:

  • High load
  • Network saturation
  • IO Saturation
  • NFS deamon running properly

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: yuvipanda, coren, Andrew.

Change 201591 had a related patch set uploaded (by Yuvipanda):
labs: Add monitoring for high iowait on labstore instances

https://gerrit.wikimedia.org/r/201591

We already have checks for network saturation as well.

Change 201591 merged by Yuvipanda:
labs: Add monitoring for high iowait on labstore instances

https://gerrit.wikimedia.org/r/201591

Change 201618 had a related patch set uploaded (by Yuvipanda):
labs: Alert on high load in labstore*

https://gerrit.wikimedia.org/r/201618

There's already network saturation alerts.

We should probably make these paging as well, though.

Also, I'm wondering if there should be *all* graphite alerts, or we should have active alerts as well. hmm.

Change 201618 merged by Yuvipanda:
labs: Alert on high load in labstore*

https://gerrit.wikimedia.org/r/201618

yuvipanda claimed this task.

Alright, so I'm going to consider this 'done' for now. More checks as warrented.