Page MenuHomePhabricator

Add monitoring and capacity planning for Nodepool
Closed, DeclinedPublic

Description

There is Zero monitoring for Nodepool. On top of my head a bare minimum would be:

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, zeljkofilipin, dduvall.

Will fill sub tasks eventually. Some I can pair them with @zeljkofilipin for puppet level up :-}

Change 244171 had a related patch set uploaded (by Hashar):
nodepool: monitor nodepoold is present

https://gerrit.wikimedia.org/r/244171

hashar set Security to None.

Change 244171 merged by Andrew Bogott:
nodepool: monitor nodepoold is present

https://gerrit.wikimedia.org/r/244171

Change 244229 had a related patch set uploaded (by Hashar):
nodepool: use nrpe:: class for monitoring

https://gerrit.wikimedia.org/r/244229

Change 244229 merged by Andrew Bogott:
nodepool: use nrpe:: class for monitoring

https://gerrit.wikimedia.org/r/244229

hashar claimed this task.

I think it is good enough for now. https://grafana.wikimedia.org/dashboard/db/nodepool has much of what I wanted.

Reopening. Would need some notifications when pool is exhausted, server side errors, and leaked instances (or alien instances).

greg added a subscriber: greg.

We're migrating away (see eg T187797), no need to do this now.