Page MenuHomePhabricator

Possible deadlock in the elastic cache used by the ltr plugin
Closed, ResolvedPublic

Description

I was tracking this directly on github but after further investigations it looks like we are at risk regarding this issue.
(ref https://github.com/o19s/elasticsearch-learning-to-rank/issues/153)
Adding a reference to phab so that we can discuss the priority of this task in regard to the risk.
In light of the analysis posted on github I see no other reasons except pure chance that we do not enter this deadlock on our production cluster. elastic 5.5.3 is affected but since we are running 5.5.2 we're only hit by a minor bug where expired entries are not evicted in time.
Running an elastic version affected by this bug (5.5.3+) could be catastrophic since all search threads will stop responding leading to all services using the _search endpoints on the cluster to be blocked (Cirrus, translation search, Phab and possibly others).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Digging further we are not affected because we are running 5.5.2 and the code involved in the deadlock was backported to 5.5.3.
5.5.2 still suffers from a cache bug but not a deadlock.
(ref https://github.com/elastic/elasticsearch/pull/26516)
This issue is not as urgent as I originally thought but should be considered as a blocker for any upgrade to a newer versions of elasticsearch.

Upstream opened a bug: https://github.com/elastic/elasticsearch/issues/30428 which has a pull request now attached, and the bug is tagged to be backported to 5.6.x. Proabably we will skip 5.6 and go straight to 6.x which will be backported as well.

EBjune claimed this task.
EBjune triaged this task as Medium priority.
EBjune subscribed.

Resolved for our purposes in search, probably should add some documentation to plugin.

Vvjjkkii renamed this task from Possible deadlock in the elastic cache used by the ltr plugin to 2hdaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed EBjune as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 2hdaaaaaaa to Possible deadlock in the elastic cache used by the ltr plugin.Jul 2 2018, 3:20 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to EBjune.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.