I was tracking this directly on github but after further investigations it looks like we are at risk regarding this issue.
(ref https://github.com/o19s/elasticsearch-learning-to-rank/issues/153)
Adding a reference to phab so that we can discuss the priority of this task in regard to the risk.
In light of the analysis posted on github ~~I see no other reasons except pure chance that we do not enter this deadlock on our production cluster.~~ elastic 5.5.3 is affected but since we are running 5.5.2 we're only hit by a minor bug where expired entries are not evicted in time.
Running an elastic version affected by this bug (5.5.3+) could be catastrophic since all search threads will stop responding leading to all services using the _search endpoints on the cluster to be blocked (Cirrus, translation search, Phab and possibly others).