Page MenuHomePhabricator

Elasticsearch restarts are failing in the logstash cluster
Closed, ResolvedPublic

Description

Restarts of the elasticsearch service running on the logstash cluster are failing (it happened on logstah100[45] so far). The service command doesn't return since the elasticsearch service fails to shutdown. Guillaume filed a bug upstream at https://github.com/elastic/elasticsearch/issues/19829

Recent rolling restarts of elasticsearch on the search cluster went fine, so this is likely caused by a different workload on the logstash cluster. This didn't happen in the past when logstash* was still running Elastic 1.7.x

Event Timeline

Uploaded diagnostic dump after reviewing it with @dcausse to ensure no confidential info is present:

I could reproduce the issue with a shard from logstash1004 (logstash-2016.07.16).

This is 77068993msec to merge doc values (21 hours) writing nearly no data.
I suspect that we have many fields with a single value causing elastic to loop over all the docs and writing nothing.

I'll keep this data on my local disk just in case we can send it to elastic.

IFD 0 [2016-08-09T14:48:51.656Z; main]: init: current segments file is "segments_2y"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@5884a914
IFD 0 [2016-08-09T14:48:51.679Z; main]: init: load commit "segments_2y"
IFD 0 [2016-08-09T14:48:51.684Z; main]: now checkpoint "_9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1" [26 segments ; isCommit = false]
IFD 0 [2016-08-09T14:48:51.685Z; main]: 0 msec to checkpoint
IW 0 [2016-08-09T14:48:51.691Z; main]: init: create=false
IW 0 [2016-08-09T14:48:51.691Z; main]: 
dir=MMapDirectory@/plat/logstash/idx2 lockFactory=org.apache.lucene.store.NativeFSLockFactory@54e041a4
index=_9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
version=5.5.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=16.0
maxBufferedDocs=-1
maxBufferedDeleteTerms=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE_OR_APPEND
similarity=org.apache.lucene.search.similarities.DefaultSimilarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1, maxMergeCount=-1, ioThrottle=true
default WRITE_LOCK_TIMEOUT=0
writeLockTimeout=0
codec=Lucene54
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.1
indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@2c78324b
readerPooling=false
perThreadHardLimitMB=1945
useCompoundFile=true
commitOnClose=true
writer=org.apache.lucene.index.IndexWriter@79defdc

IW 0 [2016-08-09T14:48:51.691Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
IW 0 [2016-08-09T14:48:51.692Z; main]: forceMerge: index now _9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
IW 0 [2016-08-09T14:48:51.692Z; main]: now flush at forceMerge
IW 0 [2016-08-09T14:48:51.692Z; main]:   start flush: applyAllDeletes=true
IW 0 [2016-08-09T14:48:51.692Z; main]:   index before flush _9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
DW 0 [2016-08-09T14:48:51.692Z; main]: startFullFlush
DW 0 [2016-08-09T14:48:51.693Z; main]: main finishFullFlush success=true
IW 0 [2016-08-09T14:48:51.693Z; main]: apply all deletes during flush
IW 0 [2016-08-09T14:48:51.693Z; main]: now apply all deletes for all segments maxDoc=14602137
BD 0 [2016-08-09T14:48:51.693Z; main]: applyDeletes: open segment readers took 0 msec
BD 0 [2016-08-09T14:48:51.693Z; main]: applyDeletes: no segments; skipping
BD 0 [2016-08-09T14:48:51.694Z; main]: prune sis=segments_2y: _9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1 minGen=0 packetCount=0
TMP 0 [2016-08-09T14:48:51.694Z; main]: findForcedMerges maxSegmentCount=1 infos=_9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1 segmentsToMerge={_a2w(4.10.4):c68285=true, _l8h(4.10.4):c474784=true, _lcy(4.10.4):c1=true, _lc4(4.10.4):c4382=true, _l7e(4.10.4):c6174=true, _l7y(4.10.4):c6492=true, _lcg(4.10.4):c3754=true, _lco(4.10.4):c4579=true, _ld1(5.5.0):c1=true, _lc0(4.10.4):c710=true, _lbv(4.10.4):c2875=true, _ld0(5.5.0):c1=true, _9jf(4.10.4):C4914053=true, _lbk(4.10.4):c44750=true, _lbr(4.10.4):c785=true, _kly(4.10.4):c532248=true, _jy7(4.10.4):C4618409=true, _gq2(4.10.4):c493901=true, _lah(4.10.4):c4800=true, _l93(4.10.4):c4111=true, _lbg(4.10.4):c512=true, _l6a(4.10.4):c6992=true, _lcz(4.10.4):c644=true, _la8(4.10.4):c3262=true, _lc7(4.10.4):c640=true, _ded(4.10.4):C3404992=true}
TMP 0 [2016-08-09T14:48:51.696Z; main]: eligible=[_ded(4.10.4):C3404992, _jy7(4.10.4):C4618409, _9jf(4.10.4):C4914053, _kly(4.10.4):c532248, _l8h(4.10.4):c474784, _gq2(4.10.4):c493901, _a2w(4.10.4):c68285, _lbk(4.10.4):c44750, _l7y(4.10.4):c6492, _l6a(4.10.4):c6992, _l7e(4.10.4):c6174, _lah(4.10.4):c4800, _lco(4.10.4):c4579, _lc4(4.10.4):c4382, _lcg(4.10.4):c3754, _l93(4.10.4):c4111, _la8(4.10.4):c3262, _lbv(4.10.4):c2875, _lbr(4.10.4):c785, _lbg(4.10.4):c512, _lc7(4.10.4):c640, _lc0(4.10.4):c710, _lcz(4.10.4):c644, _lcy(4.10.4):c1, _ld0(5.5.0):c1, _ld1(5.5.0):c1]
TMP 0 [2016-08-09T14:48:51.696Z; main]: forceMergeRunning=false
TMP 0 [2016-08-09T14:48:51.697Z; main]: add final merge=_ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
IW 0 [2016-08-09T14:48:51.697Z; main]: add merge to pendingMerges: _ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1 [total 1 pending]
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge merging= []
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_ded(4.10.4):C3404992
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_jy7(4.10.4):C4618409
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_9jf(4.10.4):C4914053
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_kly(4.10.4):c532248
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_l8h(4.10.4):c474784
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_gq2(4.10.4):c493901
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_a2w(4.10.4):c68285
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_lbk(4.10.4):c44750
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_l7y(4.10.4):c6492
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_l6a(4.10.4):c6992
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_l7e(4.10.4):c6174
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_lah(4.10.4):c4800
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_lco(4.10.4):c4579
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_lc4(4.10.4):c4382
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_lcg(4.10.4):c3754
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_l93(4.10.4):c4111
IW 0 [2016-08-09T14:48:51.697Z; main]: registerMerge info=_la8(4.10.4):c3262
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lbv(4.10.4):c2875
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lbr(4.10.4):c785
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lbg(4.10.4):c512
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lc7(4.10.4):c640
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lc0(4.10.4):c710
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lcz(4.10.4):c644
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_lcy(4.10.4):c1
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_ld0(5.5.0):c1
IW 0 [2016-08-09T14:48:51.698Z; main]: registerMerge info=_ld1(5.5.0):c1
MS 0 [2016-08-09T14:48:51.7Z; main]: initDynamicDefaults spins=true maxThreadCount=1 maxMergeCount=6
MS 0 [2016-08-09T14:48:51.7Z; main]: now merge
MS 0 [2016-08-09T14:48:51.7Z; main]:   index: _9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
MS 0 [2016-08-09T14:48:51.701Z; main]: io throttle: no merge backlog; decrease IO rate to 18.2 MB/sec
MS 0 [2016-08-09T14:48:51.701Z; main]:   consider merge _ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
MS 0 [2016-08-09T14:48:51.702Z; main]:     launch new thread [Lucene Merge Thread #0]
MS 0 [2016-08-09T14:48:51.702Z; Lucene Merge Thread #0]:   merge thread: start
IW 0 [2016-08-09T14:48:51.702Z; Lucene Merge Thread #0]: now apply deletes for 26 merging segments
BD 0 [2016-08-09T14:48:51.702Z; Lucene Merge Thread #0]: applyDeletes: open segment readers took 0 msec
BD 0 [2016-08-09T14:48:51.702Z; Lucene Merge Thread #0]: applyDeletes: no segments; skipping
MS 0 [2016-08-09T14:48:51.702Z; main]: updateMergeThreads ioThrottle=true targetMBPerSec=18.2 MB/sec
merge thread Lucene Merge Thread #0 estSize=39965.2 MB (written=0.0 MB) runTime=0.0s (stopped=0.0s, paused=0.0s) rate=unlimited
  leave running at Infinity MB/sec
BD 0 [2016-08-09T14:48:51.704Z; Lucene Merge Thread #0]: prune sis=segments_2y: _9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1 minGen=0 packetCount=0
IW 0 [2016-08-09T14:48:51.704Z; Lucene Merge Thread #0]: merge seg=_ld2 _ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
IW 0 [2016-08-09T14:48:51.704Z; Lucene Merge Thread #0]: now merge
  merge=_ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
  index=_9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
MS 0 [2016-08-09T14:48:51.704Z; main]:   no more merges pending; now return
IW 0 [2016-08-09T14:48:51.704Z; Lucene Merge Thread #0]: merging _ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
IW 0 [2016-08-09T14:48:52.979Z; Lucene Merge Thread #0]: seg=_ded(4.10.4):C3404992 no deletes
IW 0 [2016-08-09T14:48:53.225Z; Lucene Merge Thread #0]: seg=_jy7(4.10.4):C4618409 no deletes
IW 0 [2016-08-09T14:48:53.499Z; Lucene Merge Thread #0]: seg=_9jf(4.10.4):C4914053 no deletes
IW 0 [2016-08-09T14:48:53.548Z; Lucene Merge Thread #0]: seg=_kly(4.10.4):c532248 no deletes
IW 0 [2016-08-09T14:48:53.591Z; Lucene Merge Thread #0]: seg=_l8h(4.10.4):c474784 no deletes
IW 0 [2016-08-09T14:48:53.632Z; Lucene Merge Thread #0]: seg=_gq2(4.10.4):c493901 no deletes
IW 0 [2016-08-09T14:48:54.089Z; Lucene Merge Thread #0]: seg=_a2w(4.10.4):c68285 no deletes
IW 0 [2016-08-09T14:48:54.14Z; Lucene Merge Thread #0]: seg=_lbk(4.10.4):c44750 no deletes
IW 0 [2016-08-09T14:48:54.19Z; Lucene Merge Thread #0]: seg=_l7y(4.10.4):c6492 no deletes
IW 0 [2016-08-09T14:48:54.237Z; Lucene Merge Thread #0]: seg=_l6a(4.10.4):c6992 no deletes
IW 0 [2016-08-09T14:48:54.271Z; Lucene Merge Thread #0]: seg=_l7e(4.10.4):c6174 no deletes
IW 0 [2016-08-09T14:48:54.309Z; Lucene Merge Thread #0]: seg=_lah(4.10.4):c4800 no deletes
IW 0 [2016-08-09T14:48:54.345Z; Lucene Merge Thread #0]: seg=_lco(4.10.4):c4579 no deletes
IW 0 [2016-08-09T14:48:54.377Z; Lucene Merge Thread #0]: seg=_lc4(4.10.4):c4382 no deletes
IW 0 [2016-08-09T14:48:54.412Z; Lucene Merge Thread #0]: seg=_lcg(4.10.4):c3754 no deletes
IW 0 [2016-08-09T14:48:54.451Z; Lucene Merge Thread #0]: seg=_l93(4.10.4):c4111 no deletes
IW 0 [2016-08-09T14:48:54.48Z; Lucene Merge Thread #0]: seg=_la8(4.10.4):c3262 no deletes
IW 0 [2016-08-09T14:48:54.513Z; Lucene Merge Thread #0]: seg=_lbv(4.10.4):c2875 no deletes
IW 0 [2016-08-09T14:48:54.528Z; Lucene Merge Thread #0]: seg=_lbr(4.10.4):c785 no deletes
IW 0 [2016-08-09T14:48:54.546Z; Lucene Merge Thread #0]: seg=_lbg(4.10.4):c512 no deletes
IW 0 [2016-08-09T14:48:54.561Z; Lucene Merge Thread #0]: seg=_lc7(4.10.4):c640 no deletes
IW 0 [2016-08-09T14:48:54.568Z; Lucene Merge Thread #0]: seg=_lc0(4.10.4):c710 no deletes
IW 0 [2016-08-09T14:48:54.585Z; Lucene Merge Thread #0]: seg=_lcz(4.10.4):c644 no deletes
IW 0 [2016-08-09T14:48:54.588Z; Lucene Merge Thread #0]: seg=_lcy(4.10.4):c1 no deletes
IW 0 [2016-08-09T14:48:54.603Z; Lucene Merge Thread #0]: seg=_ld0(5.5.0):c1 no deletes
IW 0 [2016-08-09T14:48:54.604Z; Lucene Merge Thread #0]: seg=_ld1(5.5.0):c1 no deletes
SM 0 [2016-08-09T14:48:55.002Z; Lucene Merge Thread #0]: merge store matchedCount=26 vs 26
SM 0 [2016-08-09T14:50:37.047Z; Lucene Merge Thread #0]: 102077 msec to merge stored fields [14602137 docs]
SM 0 [2016-08-09T14:55:51.456Z; Lucene Merge Thread #0]: 314406 msec to merge postings [14602137 docs]
SM 0 [2016-08-10T12:20:20.449Z; Lucene Merge Thread #0]: 77068993 msec to merge doc values [14602137 docs]
SM 0 [2016-08-10T12:20:20.559Z; Lucene Merge Thread #0]: 110 msec to write field infos [14602137 docs]
IW 0 [2016-08-10T12:20:20.56Z; Lucene Merge Thread #0]: merge codec=Lucene54 maxDoc=14602137; merged segment has no vectors; no norms; docValues; prox; freqs; 77485.8 sec (0.0 sec stopped, 0.0 sec paused) to merge segment [11572.12 MB, 0.15 MB/sec]
IW 0 [2016-08-10T12:20:20.563Z; Lucene Merge Thread #0]: merged segment size=11572.117 MB vs estimate=39965.235 MB
IW 0 [2016-08-10T12:20:20.563Z; Lucene Merge Thread #0]: commitMerge: _ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1 index=_9jf(4.10.4):C4914053 _ded(4.10.4):C3404992 _a2w(4.10.4):c68285 _jy7(4.10.4):C4618409 _gq2(4.10.4):c493901 _l8h(4.10.4):c474784 _kly(4.10.4):c532248 _l6a(4.10.4):c6992 _lbk(4.10.4):c44750 _l7e(4.10.4):c6174 _l7y(4.10.4):c6492 _lah(4.10.4):c4800 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _lbg(4.10.4):c512 _lbr(4.10.4):c785 _lco(4.10.4):c4579 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lc7(4.10.4):c640 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
IW 0 [2016-08-10T12:20:20.563Z; Lucene Merge Thread #0]: commitMergeDeletes _ded(4.10.4):C3404992 _jy7(4.10.4):C4618409 _9jf(4.10.4):C4914053 _kly(4.10.4):c532248 _l8h(4.10.4):c474784 _gq2(4.10.4):c493901 _a2w(4.10.4):c68285 _lbk(4.10.4):c44750 _l7y(4.10.4):c6492 _l6a(4.10.4):c6992 _l7e(4.10.4):c6174 _lah(4.10.4):c4800 _lco(4.10.4):c4579 _lc4(4.10.4):c4382 _lcg(4.10.4):c3754 _l93(4.10.4):c4111 _la8(4.10.4):c3262 _lbv(4.10.4):c2875 _lbr(4.10.4):c785 _lbg(4.10.4):c512 _lc7(4.10.4):c640 _lc0(4.10.4):c710 _lcz(4.10.4):c644 _lcy(4.10.4):c1 _ld0(5.5.0):c1 _ld1(5.5.0):c1
IW 0 [2016-08-10T12:20:20.566Z; Lucene Merge Thread #0]: no new deletes or field updates since merge started
IFD 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: now checkpoint "_ld2(5.5.0):C14602137" [1 segments ; isCommit = false]
IFD 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: 0 msec to checkpoint
IW 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: after commitMerge: _ld2(5.5.0):C14602137
TMP 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: findForcedMerges maxSegmentCount=1 infos=_ld2(5.5.0):C14602137 segmentsToMerge={_a2w(4.10.4):c68285=true, _l8h(4.10.4):c474784=true, _lcy(4.10.4):c1=true, _lc4(4.10.4):c4382=true, _l7e(4.10.4):c6174=true, _l7y(4.10.4):c6492=true, _lcg(4.10.4):c3754=true, _lco(4.10.4):c4579=true, _ld1(5.5.0):c1=true, _lc0(4.10.4):c710=true, _lbv(4.10.4):c2875=true, _ld2(5.5.0):C14602137=false, _ld0(5.5.0):c1=true, _9jf(4.10.4):C4914053=true, _lbk(4.10.4):c44750=true, _lbr(4.10.4):c785=true, _kly(4.10.4):c532248=true, _jy7(4.10.4):C4618409=true, _gq2(4.10.4):c493901=true, _lah(4.10.4):c4800=true, _l93(4.10.4):c4111=true, _lbg(4.10.4):c512=true, _l6a(4.10.4):c6992=true, _lcz(4.10.4):c644=true, _la8(4.10.4):c3262=true, _lc7(4.10.4):c640=true, _ded(4.10.4):C3404992=true}
TMP 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: already merged
IW 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: merge time 77488987 msec for 14602137 docs
MS 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]:   merge thread: done
MS 0 [2016-08-10T12:20:20.689Z; Lucene Merge Thread #0]: now merge
MS 0 [2016-08-10T12:20:20.691Z; Lucene Merge Thread #0]:   index: _ld2(5.5.0):C14602137
MS 0 [2016-08-10T12:20:20.691Z; Lucene Merge Thread #0]:   no more merges pending; now return
MS 0 [2016-08-10T12:20:20.692Z; Lucene Merge Thread #0]: updateMergeThreads ioThrottle=true targetMBPerSec=18.2 MB/sec

Hi @dcausse is there anything else we need to do?

I took a look over the github issue and added a comment.I see the mention above that the 2016-07-16 index allows reproducing the issue. I suspect that is related to T141384, based on the question mike posted asking about sparse fields. Unfortunately today is the 16th, so that index is no longer on the server. Looking at the following day though we see that the restbase type has 33k fields. Many of these fields only show up in a single document, and would be incredibly sparse. Still a guess at this point, but seems plausible.

I have a copy of the shard on disk. But yes having many fields with a single value can explain what happens during a force merge.

I sent the NDA to elasticsearch for signing. Since we have a likely explanation of the issue, it is not as crucial to send that shard to elasticsearch.

We still want to:

  1. put in place some monitoring / warning so that we know upfront if an application starts misbehaving again
  2. find a way to cleanup old indices so that future restart of this elasticsearch cluster succeed

@dcausse, @EBernhardson: It does not look like we can remove those fields after the fact. It looks to me like our options are either delete the problematic indices or dump them, clean them and re-import. Is there a cleaner solution that I'm overlooking?

@Gehel correct that we can't remove fields after the fact, dump and reload is basically the only option elasticsearch provides.

From the response to our github issue though, i wonder if the merge itself will fix the issue? Mike said it's related to merging segments created by the lucene version used in 1.7.x, the newer lucene used in 2.3 should be compacting the doc values and avoiding the issue.

Restarts during the 2.3.4 upgrade also hang. I did not dig deep enough, but I think we do have crons to do those merges regularly, so they *should* have been compacted by now. I'll dig a bit deeper.

Checking the different segments, all have been upgraded to lucene 5.5.0 except the grafana-dashboards and kibana-int indices. I will force a merge on them, but they are not the problematic indices.

Mentioned in SAL [2016-08-24T13:53:17Z] <gehel> rolling restart of logstash nodes to validate fix to T142357

Number of fields per index is high for indices created before August 1st:

logstash-2016.07.25 : 106796
logstash-2016.07.26 : 167490
logstash-2016.07.27 : 25703
logstash-2016.07.28 : 28575
logstash-2016.07.29 : 49858
logstash-2016.07.30 : 25617
logstash-2016.07.31 : 20799
logstash-2016.08.01 : 2097
logstash-2016.08.02 : 1032

(Note that the numbers above are a rough estimate, not accounting for the fact that fields with the same name in 2 different types are stored as a single field).

Since we keep only 31 days of data, those broken indices will disappear on their own in ~7 days. Unless there is an objection, I will proactively delete them. It does not make sense to invest time in cleaning them.

Mentioned in SAL [2016-08-24T17:27:26Z] <gehel> deleting logstash indices from before august 1st (T142357)

Mentioned in SAL [2016-08-29T14:06:28Z] <gehel> restart elasticsearch on logstash1004 to validate fix for T142357

Restart of elasticsearch on logstash1004 while a force merge is in progress took ~ 5 minutes. This is a reasonable time, closing this issue.