In the process of resolving T106165: Upgrade production to elasticsearch 1.7.1 we have come up with some cluster health and robustness questions.
Generally, this is related to the long standing issue of ES upgrades taking a long time and the failure of fast-restart. The failure of fast-restart seems possibly related to some general cluster cohesion issues as noted in https://phabricator.wikimedia.org/T108180#1517896. It seems like a symptom of a possibly deeper issue(s). In digging into this over the past week I have been bothering @dcausse daily :) and a lot of the questions raised are not new it seems as I see them in historical postmortems. These concerns seem to have fallen through the cracks a bit and so I am hoping to make tasks and link them there.
It's worth noting that this is all most likely an outcome of the success of the ES deployment here and the increased load on the cluster over time :) But search did have the worst uptime of any service we reported on last quarter using our external monitoring tool. We are maybe in a place where our failure modes are increasingly catastrophic since the last few instances of outage were resolved by restarting the entire cluster.