Change Details

There is Zero monitoring for Nodepool. On top of my head a bare minimum would be: [ x] process present on labnodepool1001.eqiad.wmnet (though puppet or systemd restart it) - https://gerrit.wikimedia.org/r/#/c/244171/ [ ] CPU / load / mem usage [ ] viability of MySQL sessions (one per booted instance, does not recover properly on networking flap) [ x] graphs of the pool (needs to restrict the metrics Nodepool send, it is too spammy): https://grafana.wikimedia.org/dashboard/db/nodepool [ ] alert when pool is exhausted [ ] reachability of OpenStack API as the nodepool user [ ] detect weird behavior such as snapshot/instances creation failures [ ] send Nodepool log to LogStash [ ] review monitoring contact list [ ] paging? [ x] first level diagnostics proceduress - https://wikitech.wikimedia.org/wiki/Nodepool [ x] number of Nodepool managed slaves ready to accept jobs and # of offline nodes https://grafana.wikimedia.org/dashboard/db/nodepool [ x] [[http://phabricatoror.wikimedia.org/T2001 | (bug 1)]] - https://wikitech.wikimedia.org/T2001 | (bug 1)]]wiki/Nodepool