Page MenuHomePhabricator

ES Cluster Outage after GKE node update
Closed, ResolvedPublic

Description

We had an ES Cluster outage on 19th February 2024.

This was initially caused by a GKE node restart within our GKE maintenance window.
On the 19th we were optimistic that this may fully recover without us intervening. On the 20th we decided the service may not recover without intervention.

We believe the initial issue was caused by all three k8s nodes being restarted only 1 hour apart.

Given that it takes much longer for this for an ElasticSearch node to recover by the time all three nodes had restarted we had an elasticsearch cluster that required starting from totally cold.

We increased the limits for the master and data nodes and reverted the default start-up probe: https://github.com/wmde/wbaas-deploy/pull/1436

We also doubled the memory available for master nodes: https://github.com/wmde/wbaas-deploy/commit/527a5981a64d7b2d4f89005bfe0b3bb2dcbbcc2e

We then returned to the previous custom startup probe https://github.com/wmde/wbaas-deploy/commit/8a69c2fb6eb3b0eafd4b7dd198581fc2c38db1e7

Event Timeline

We split Elasticsearch's master and data nodes into their own GKE node pools and configured those pools to use a blue-green upgrade strategy. That way when a GKE node upgrade runs, only one Elasticsearch node will be taken down at a time. Since our Elasticsearch shards have node redundancy, search should continue to operate normally even with a slightly degraded cluster.

Tarrow claimed this task.