I was tracking this directly on github but after further investigations it looks like we are at risk regarding this issue.
(ref https://github.com/o19s/elasticsearch-learning-to-rank/issues/153)
Adding a reference to phab so that we can discuss the priority of this task in regard to the risk.
In light of the analysis posted on github I see no other reasons except pure chance that we do not enter this deadlock on our production cluster. elastic 5.5.3 is affected but since we are running 5.5.2 we're only hit by a minor bug where expired entries are not evicted in time.
Running an elastic version affected by this bug (5.5.3+) could be catastrophic since all search threads will stop responding leading to all services using the _search endpoints on the cluster to be blocked (Cirrus, translation search, Phab and possibly others).
Description
Description
Event Timeline
Comment Actions
Digging further we are not affected because we are running 5.5.2 and the code involved in the deadlock was backported to 5.5.3.
5.5.2 still suffers from a cache bug but not a deadlock.
(ref https://github.com/elastic/elasticsearch/pull/26516)
This issue is not as urgent as I originally thought but should be considered as a blocker for any upgrade to a newer versions of elasticsearch.
Comment Actions
Upstream opened a bug: https://github.com/elastic/elasticsearch/issues/30428 which has a pull request now attached, and the bug is tagged to be backported to 5.6.x. Proabably we will skip 5.6 and go straight to 6.x which will be backported as well.
Comment Actions
Resolved for our purposes in search, probably should add some documentation to plugin.