Page MenuHomePhabricator

mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes
Closed, ResolvedPublic

Description

At 8am UTC Aug 17, mjolnir-kafka-bulk-daemon failed on all elasticsearch / eqiad nodes. The logs indicates this was an HTTP connection refused, probably to the local elasticsearch instance.

Mjolnir relies on systemd to restart it in case of transient failures. So it is expected that this unit will fail regularly and be restarted. It should be possible to have systemd not report it as failed until it has failed to restart for a number of times.

Side note: I'm wondering why we had a transient failure across the whole cluster.

Event Timeline

Gehel created this task.Aug 17 2018, 8:09 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2018, 8:09 AM
Gehel triaged this task as High priority.Aug 17 2018, 8:09 AM
Gehel added a comment.Aug 17 2018, 8:35 AM

There seem to be some correlation with a high number of failed relocations that happened just before mjolnir failing (see [[ URL | logstash ]]). No idea if there is a causality here.

Mentioned in SAL (#wikimedia-operations) [2018-08-17T16:24:48Z] <gehel> disabling systemd state check for elastic eqiad until T202120 is fixed

Looking at elastic1020 we have in journalctl -u mjolnir-kafka-bulk-daemon

Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk-daemon.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk-daemon.service: Unit entered failed state.
Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk-daemon.service: Failed with result 'exit-code'.
Aug 17 16:33:53 elastic1020 systemd[1]: Started MjoLniR kafka bulk daemon.

There is a matching puppet log in journalctl for puppet:

Aug 17 16:33:53 elastic1020 puppet-agent[9782]: (/Stage[main]/Profile::Mjolnir::Kafka_bulk_daemon/Systemd::Service[mjolnir-kafka-bulk-daemon]/Service[mjolnir-kafka-bulk-daemon]/ensure

This suggests systemd isn't restarting the service, and we should set Restart=always in the systemd config to match how the service expects to run.

Change 453450 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Mjolnir daemons should run with Restart=always

https://gerrit.wikimedia.org/r/453450

Change 453450 merged by Gehel:
[operations/puppet@production] Mjolnir daemons should run with Restart=always

https://gerrit.wikimedia.org/r/453450

Gehel added a comment.Aug 20 2018, 1:51 PM

Restart=always on the systemd unit should fix the immediate issue. This has been deployed. I'm keeping this task open for a few more days, until we can validate that the issue is not reproduced.

A full export ran over the daemon from 8/20 13:24 to 8/22 02:00 without triggering this issue again. I think it can be closed.

debt closed this task as Resolved.Aug 24 2018, 4:00 PM