There is Zero monitoring for Nodepool. On top of my head a bare minimum would be:
- process present on labnodepool1001.eqiad.wmnet (though puppet or systemd restart it) - https://gerrit.wikimedia.org/r/#/c/244171/
- CPU / load / mem usage
- viability of MySQL sessions (one per booted instance, does not recover properly on networking flap)
- graphs of the pool : https://grafana.wikimedia.org/dashboard/db/nodepool
- alert when pool is exhausted
- reachability of OpenStack API as the nodepool user
- detect weird behavior such as snapshot/instances creation failures
- send Nodepool log to LogStash
- review monitoring contact list
- paging?
- first level diagnostics procedures - https://wikitech.wikimedia.org/wiki/Nodepool
- number of Nodepool managed slaves ready to accept jobs and # of offline nodes https://grafana.wikimedia.org/dashboard/db/nodepool
- (bug 1) - https://wikitech.wikimedia.org/wiki/Nodepool