In the process of resolving T106165: Upgrade production to elasticsearch 1.7.1 we have come up with some cluster health and robustness questions.
Generally, this is related to the long standing issue of ES upgrades taking a long time and the failure of fast-restart. The failure of fast-restart seems possibly related to some general cluster cohesion issues as noted in https://phabricator.wikimedia.org/T108180#1517896. It seems like a symptom of a possibly deeper issue(s). In digging into this over the past week I have been bothering @dcausse daily :) and a lot of the questions raised are not new it seems as I see them in historical postmortems. These concerns seem to have fallen through the cracks a bit and so I am hoping to make tasks and link them there.
Related issues:
It's worth noting that this is all most likely an outcome of the success of the ES deployment here and the increased load on the cluster over time :) But search did have the worst uptime of any service we reported on last quarter using our external monitoring tool. We are maybe in a place where our failure modes are increasingly catastrophic since the last few instances of outage were resolved by restarting the entire cluster.