We had an ES Cluster outage on 19th February 2024.
This was initially caused by a GKE node restart within our GKE maintenance window.
On the 19th we were optimistic that this may fully recover without us intervening. On the 20th we decided the service may not recover without intervention.
We believe the initial issue was caused by all three k8s nodes being restarted only 1 hour apart.
Given that it takes much longer for this for an ElasticSearch node to recover by the time all three nodes had restarted we had an elasticsearch cluster that required starting from totally cold.
We increased the limits for the master and data nodes and reverted the default start-up probe: https://github.com/wmde/wbaas-deploy/pull/1436
We also doubled the memory available for master nodes: https://github.com/wmde/wbaas-deploy/commit/527a5981a64d7b2d4f89005bfe0b3bb2dcbbcc2e
We then returned to the previous custom startup probe https://github.com/wmde/wbaas-deploy/commit/8a69c2fb6eb3b0eafd4b7dd198581fc2c38db1e7