Page MenuHomePhabricator

Add monitoring and capacity planning for Nodepool
Closed, DeclinedPublic

Description

There is Zero monitoring for Nodepool. On top of my head a bare minimum would be:

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, zeljkofilipin, dduvall.

Will fill sub tasks eventually. Some I can pair them with @zeljkofilipin for puppet level up :-}

Change 244171 had a related patch set uploaded (by Hashar):
nodepool: monitor nodepoold is present

https://gerrit.wikimedia.org/r/244171

hashar set Security to None.

Change 244171 merged by Andrew Bogott:
nodepool: monitor nodepoold is present

https://gerrit.wikimedia.org/r/244171

Change 244229 had a related patch set uploaded (by Hashar):
nodepool: use nrpe:: class for monitoring

https://gerrit.wikimedia.org/r/244229

Change 244229 merged by Andrew Bogott:
nodepool: use nrpe:: class for monitoring

https://gerrit.wikimedia.org/r/244229

hashar claimed this task.

I think it is good enough for now. https://grafana.wikimedia.org/dashboard/db/nodepool has much of what I wanted.

Reopening. Would need some notifications when pool is exhausted, server side errors, and leaked instances (or alien instances).

greg subscribed.

We're migrating away (see eg T187797), no need to do this now.