In the process of resolving {T106165} we have come up with some cluster health and robustness questions.
Generally, this is related to the long standing issue of ES upgrades taking a long time and the failure of fast-restart. The failure of fast-restart seems possibly related to some general cluster cohesion issues as noted in https://phabricator.wikimedia.org/T108180#1517896. It seems like a symptom of a possibly deeper issue(s). In digging into this over the past week I have been bothering @dcausse daily :) and a lot of the questions raised are not new it seems as I see them in historical [[ https://wikitech.wikimedia.org/wiki/Incident_documentation/20150615-Elasticsearch | postmortems ]]. These concerns seem to have fallen through the cracks a bit and so I am hoping to make tasks and link them there.
Related issues:
T76090
T102594
T90889