ops channel being spammed with CRITICAL icinga alerts from seemingly all the prometheus-based checks in esams, due to some issue with the prometheus sever instance on bast3002 (shouldn't these be UNKNOWNs in a case like this?).
Loadavg, iowait%, disk read traffic all seem high on bast3002. prometheus@ops had these in the journalctl log from near the time of the start of problems (but not exactly):
Apr 19 22:33:21 bast3002 prometheus@ops[9406]: time="2018-04-19T22:33:21Z" level=warning msg="Error while evaluating rule \"ALERT PrometheusReloadFailed\n\tIF prometheus_config_last_reload_successful == 0\n\tFOR 1h\n\tLABELS {severity=\\"warn\\"}\n\tANNOTATIONS {SUMMARY=\\"Prometheus {{$labels.instance}} config reload fail\\"}\": query timed out in query queue" source="manager.go:282" Apr 19 22:33:21 bast3002 prometheus@ops[9406]: time="2018-04-19T22:33:21Z" level=warning msg="Error while evaluating rule \"ALERT InstanceDown\n\tIF up == 0\n\tFOR 3m\n\tLABELS {severity=\\"warn\\"}\n\tANNOTATIONS {SUMMARY=\\"Instance {{$labels.instance}}/{{$labels.job}} down\\"}\": query timed out in query queue" source="manager.go:282"
If it flaps again, I'll try restarting the prometheus service there and see what happens.