Prometheus codfw was running low on disk space, I've adjusted the retention period in https://gerrit.wikimedia.org/r/#/c/342810/. Though the additional i/o bandwidth required for maintenance exceeded what was available from spinning disks in ganeti codfw.
Also for some reason prometheus in esams upon restart was found with one of its leveldb databases corrupted.
- I've bandaided prometheus2001 by giving it more memory for chunks and other tuning, this needs to be puppetized as well
- esams Prometheus had accumulated a large number of metrics over time, likely due to varnish/vcl metric churn documented at T150479 and was taking a long time to recover, I've moved aside its metrics and started fresh
- The alerts on prometheus machines should catch these conditions ("rushed mode")