ES Cluster Outage after GKE node update
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tarrow
	Feb 20 2024, 11:25 AM

Description

We had an ES Cluster outage on 19th February 2024.

This was initially caused by a GKE node restart within our GKE maintenance window.
On the 19th we were optimistic that this may fully recover without us intervening. On the 20th we decided the service may not recover without intervention.

We believe the initial issue was caused by all three k8s nodes being restarted only 1 hour apart.

Given that it takes much longer for this for an ElasticSearch node to recover by the time all three nodes had restarted we had an elasticsearch cluster that required starting from totally cold.

We increased the limits for the master and data nodes and reverted the default start-up probe: https://github.com/wmde/wbaas-deploy/pull/1436

We also doubled the memory available for master nodes: https://github.com/wmde/wbaas-deploy/commit/527a5981a64d7b2d4f89005bfe0b3bb2dcbbcc2e

We then returned to the previous custom startup probe https://github.com/wmde/wbaas-deploy/commit/8a69c2fb6eb3b0eafd4b7dd198581fc2c38db1e7

Event Timeline

Tarrow created this task.Feb 20 2024, 11:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 20 2024, 11:25 AM

GreenReaper subscribed.Feb 21 2024, 9:38 AM

Tarrow updated the task description. (Show Details)Mar 7 2024, 5:26 PM

We split Elasticsearch's master and data nodes into their own GKE node pools and configured those pools to use a blue-green upgrade strategy. That way when a GKE node upgrade runs, only one Elasticsearch node will be taken down at a time. Since our Elasticsearch shards have node redundancy, search should continue to operate normally even with a slightly degraded cluster.

Andrew-WMDE removed Andrew-WMDE as the assignee of this task.Mar 11 2024, 10:00 AM

Andrew-WMDE moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q1 2024) board.

Andrew-WMDE moved this task from In Review to Done on the Wikibase Cloud (Kanban board Q1 2024) board.

Andrew-WMDE subscribed.

Tarrow closed this task as Resolved.Mar 28 2024, 11:50 AM

Tarrow claimed this task.

ES Cluster Outage after GKE node updateClosed, ResolvedPublicActions

Description

Event Timeline

ES Cluster Outage after GKE node update
Closed, ResolvedPublic
Actions