Maniphest T202120

mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Aug 17 2018, 8:09 AM

Description

At 8am UTC Aug 17, mjolnir-kafka-bulk-daemon failed on all elasticsearch / eqiad nodes. The logs indicates this was an HTTP connection refused, probably to the local elasticsearch instance.

Mjolnir relies on systemd to restart it in case of transient failures. So it is expected that this unit will fail regularly and be restarted. It should be possible to have systemd not report it as failed until it has failed to restart for a number of times.

Side note: I'm wondering why we had a transient failure across the whole cluster.

Details

	Subject	Repo	Branch	Lines +/-
	Mjolnir daemons should run with Restart=always	operations/puppet	production	+2 -0

Customize query in gerrit

Event Timeline

Gehel created this task.Aug 17 2018, 8:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2018, 8:09 AM

Gehel triaged this task as High priority.Aug 17 2018, 8:09 AM

There seem to be some correlation with a high number of failed relocations that happened just before mjolnir failing (see [[ URL | logstash ]]). No idea if there is a causality here.

Mentioned in SAL (#wikimedia-operations) [2018-08-17T16:24:48Z] <gehel> disabling systemd state check for elastic eqiad until T202120 is fixed

Looking at elastic1020 we have in journalctl -u mjolnir-kafka-bulk-daemon

Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk-daemon.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk-daemon.service: Unit entered failed state.
Aug 17 16:14:14 elastic1020 systemd[1]: mjolnir-kafka-bulk-daemon.service: Failed with result 'exit-code'.
Aug 17 16:33:53 elastic1020 systemd[1]: Started MjoLniR kafka bulk daemon.

There is a matching puppet log in journalctl for puppet:

Aug 17 16:33:53 elastic1020 puppet-agent[9782]: (/Stage[main]/Profile::Mjolnir::Kafka_bulk_daemon/Systemd::Service[mjolnir-kafka-bulk-daemon]/Service[mjolnir-kafka-bulk-daemon]/ensure

This suggests systemd isn't restarting the service, and we should set Restart=always in the systemd config to match how the service expects to run.

Change 453450 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Mjolnir daemons should run with Restart=always

https://gerrit.wikimedia.org/r/453450

gerritbot added a project: Patch-For-Review.Aug 17 2018, 6:04 PM

EBernhardson claimed this task.Aug 17 2018, 6:06 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Change 453450 merged by Gehel:
[operations/puppet@production] Mjolnir daemons should run with Restart=always

https://gerrit.wikimedia.org/r/453450

Restart=always on the systemd unit should fix the immediate issue. This has been deployed. I'm keeping this task open for a few more days, until we can validate that the issue is not reproduced.

A full export ran over the daemon from 8/20 13:24 to 8/22 02:00 without triggering this issue again. I think it can be closed.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Aug 22 2018, 5:43 PM

debt closed this task as Resolved.Aug 24 2018, 4:00 PM

mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodesClosed, ResolvedPublicActions

Description

Details

Event Timeline

mjolnir-kafka-bulk-daemon failed on all elastic / eqiad nodes
Closed, ResolvedPublic
Actions