nested RemoteTransportExceptions filled the disk on elastic1036 and elastic1045 during a rolling restart
Closed, ResolvedPublic

Description

It happened during the rolling restarts needed for a security upgrade (T138811).
It's still unclear what caused such flood in the logs but it's certainly the symptom of an deeper issue.
Create ticket upstream https://github.com/elastic/elasticsearch/issues/19187

If it happens again one can truncate the log file and restarting one of the affected node seemed to fix the issue.

FTR: I'll upload the logs to terbium under a folder named with this ticket id.

dcausse created this task.Jun 30 2016, 1:35 PM
Restricted Application added a project: Discovery. · View Herald TranscriptJun 30 2016, 1:35 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
dcausse moved this task from Needs triage to Ops on the Discovery board.Jun 30 2016, 2:40 PM
dcausse moved this task from Needs triage to Later on the Discovery-Search board.

This is planned to be fixed upstream in elasticsearch 2.4.0. In the meantime, we could mitigate the issue with T130590.

Mentioned in SAL (#wikimedia-operations) [2017-02-05T23:41:16Z] <gehel> truncating elasticsearch logs on elastic1022 - T139043

Mentioned in SAL (#wikimedia-operations) [2017-02-05T23:42:30Z] <gehel> truncating elasticsearch logs on elastic10(24|26|40) - T139043

Gehel added a comment.Feb 6 2017, 9:39 AM

Looking at [[ 2gMo9gn2p3Myxu | graphs ]] and logs, it looks like the issue this weekend caused some trouble from 2017-02-04 21:00 UTC to 2017-02-05 23:45 UTC. Logstash indicates a numbre of indexation failures:

Search backend error during sending {numBulk} documents to the UNKNOWN index(s) after 2:

(note that some variable replacement in the log message seems to not occur).

I'll start a reindex for the outage timeframe.

Mentioned in SAL (#wikimedia-operations) [2017-02-06T09:40:07Z] <gehel> elasticsearch - reindexing from 2017-02-04T20:00:00Z to 2017-02-05T23:59:00Z - T139043

Mentioned in SAL (#wikimedia-operations) [2017-02-09T09:52:46Z] <gehel> cleaning up logs on elastic20(01|16) - T139043

Mentioned in SAL (#wikimedia-operations) [2017-02-11T09:09:28Z] <gehel> cleanup logs on elastic20(01|25) - T139043

Gehel closed this task as Resolved.Mon, Nov 27, 3:14 PM
Gehel claimed this task.

This should be fixed since elastic 2.4.0 (we are at 5.5) and we have not seen the issue again. Closing.