At 8am UTC Aug 17, mjolnir-kafka-bulk-daemon failed on all elasticsearch / eqiad nodes. The logs indicates this was an HTTP connection refused, probably to the local elasticsearch instance.
Mjolnir relies on systemd to restart it in case of transient failures. So it is expected that this unit will fail regularly and be restarted. It should be possible to have systemd not report it as failed until it has failed to restart for a number of times.
Side note: I'm wondering why we had a transient failure across the whole cluster.