We have some stats collection that only happens on master nodes of the elasticsearch clusters. While reviewing some dashboards today I noticed that one hadn't been reporting data. Logging into the instance and testing the ports, the local prometheus daemon was accepting connections but never responding. Restarting the daemon (prometheus-wmf-elasticsearch-exporter) restored stats collection.
It seems we should expect that daemons may get stuck at some point or another, the main issue here seems to be that we didn't have any way to know this daemon stopped working correctly other than looking at the stats, monitors like the systemd service are happy that the process is still running.
AC: Daemons are automatically restarted, repeated failures gives some sort of notice