nested RemoteTransportExceptions filled the disk on elastic1036 and elastic1045 during a rolling restart
Open, Needs TriagePublic

Description

It happened during the rolling restarts needed for a security upgrade (T138811).
It's still unclear what caused such flood in the logs but it's certainly the symptom of an deeper issue.
Create ticket upstream https://github.com/elastic/elasticsearch/issues/19187

If it happens again one can truncate the log file and restarting one of the affected node seemed to fix the issue.

FTR: I'll upload the logs to terbium under a folder named with this ticket id.

dcausse created this task.Jun 30 2016, 1:35 PM
Restricted Application added a project: Discovery. · View Herald TranscriptJun 30 2016, 1:35 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
dcausse moved this task from Needs triage to Ops on the Discovery board.Jun 30 2016, 2:40 PM
dcausse moved this task from Needs triage to Later on the Discovery-Search board.

This is planned to be fixed upstream in elasticsearch 2.4.0. In the meantime, we could mitigate the issue with T130590.

Mentioned in SAL (#wikimedia-operations) [2017-02-05T23:41:16Z] <gehel> truncating elasticsearch logs on elastic1022 - T139043

Mentioned in SAL (#wikimedia-operations) [2017-02-05T23:42:30Z] <gehel> truncating elasticsearch logs on elastic10(24|26|40) - T139043

Gehel added a comment.Feb 6 2017, 9:39 AM

Looking at [[ 2gMo9gn2p3Myxu | graphs ]] and logs, it looks like the issue this weekend caused some trouble from 2017-02-04 21:00 UTC to 2017-02-05 23:45 UTC. Logstash indicates a numbre of indexation failures:

Search backend error during sending {numBulk} documents to the UNKNOWN index(s) after 2:

(note that some variable replacement in the log message seems to not occur).

I'll start a reindex for the outage timeframe.

Mentioned in SAL (#wikimedia-operations) [2017-02-06T09:40:07Z] <gehel> elasticsearch - reindexing from 2017-02-04T20:00:00Z to 2017-02-05T23:59:00Z - T139043

Mentioned in SAL (#wikimedia-operations) [2017-02-09T09:52:46Z] <gehel> cleaning up logs on elastic20(01|16) - T139043

Mentioned in SAL (#wikimedia-operations) [2017-02-11T09:09:28Z] <gehel> cleanup logs on elastic20(01|25) - T139043