etcd cluster reimage strategies to use with the K8s upgrade cookbook
Open, MediumPublic
Actions

Assigned To

None

Authored By

	elukey
	Feb 20 2023, 10:34 AM

Description

In the sre.k8s.upgrade-cluster cookbook there is the possibility of reimaging the etcd cluster (of a given k8s cluster) to a target OS, that in the case of 1.23 is bullseye. We can reimage one node at the time, so the first strategy that I used at first was:

Stop etcd on all nodes and disable puppet.
Reimage nodes one at the time.

The idea was to basically bootstrap a cluster from scratch, and I naively though that it would have worked. One thing that I discovered is that the SRV records env variables in /etc/default/etcd play a role, and when the first etcd node boots for the first time it is very upset to not find any of its other members alive and reachable for a leader election. Setting the environment variable to tag the cluster as new and commenting all the SRV/Discovery related ones seems to work, namely the node bootstraps itself but thinking it is on a single-node cluster. The bootstrap of the rest of the nodes is not very clean, it requires some manual restarts and hacks as well, so I thought to try the following procedure instead:

Stop etcd on one node (while the rest is up).
Reimage the node.
Wait for the cluster to be healthy
goto 1) with another node

The procedure seemed sound but I didn't know that another problem may have arisen, namely that after reimaging the first node it won't have bootstrapped correctly since the Raft log's last commit/id wasn't matching the one provided by the rest of the nodes. After reading some guides I discovered that simply removing/adding the member from the cluster is sufficient to let it bootstrap with a brand new Raft log (that would be synced with the one provided by the rest of the cluster). For the moment adding/removing nodes is not well supported by Spicerack IIRC, but maybe we could add some support.

Does anybody else have a different experience with etcd? Is there another procedure to do it safely? We should probably add something to https://wikitech.wikimedia.org/wiki/Etcd at the end of the task.

Related Objects
Search...

Status	Assigned	Task
Resolved	JMeybohm	T307943 Update Kubernetes clusters to v1.23
Open	None	T341984 Update Kubernetes clusters to >1.25
Open	None	T330060 etcd cluster reimage strategies to use with the K8s upgrade cookbook

Event Timeline

elukey created this task.Feb 20 2023, 10:34 AM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptFeb 20 2023, 10:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

JMeybohm subscribed.Feb 20 2023, 10:36 AM

The experience above matches my own, when I have to add/remove nodes from the cluster. IIRC when reimaging the cluster though and given we don't care at all about preserving the content in it, reimaging all the 3 nodes together works fine. We should probably retest my memory when re-imaging eqiad.

@akosiaris I added https://wikitech.wikimedia.org/wiki/Etcd#Reimage_nodes_a_cluster, lemme know if it makes sense.

In T330060#8644396, @akosiaris wrote:

The experience above matches my own, when I have to add/remove nodes from the cluster. IIRC when reimaging the cluster though and given we don't care at all about preserving the content in it, reimaging all the 3 nodes together works fine. We should probably retest my memory when re-imaging eqiad.

Yeah, that kind of works (I did that in staging and wikikube-codfw) but it also lead to at least one "first puppet run failed" error during reimage as the fist node to become ready will most likely fail. It's all about timing in that case I suppose.

@JMeybohm did you have to set the cluster's status to new by any chance? I had to for ml-etcd2* nodes, otherwise their etcd daemons didn't want to bootstrap..

Edit: yes I see profile::etcd::v3::cluster_bootstrap: true set in various clusters. I'll add to the docs, it needs to be set otherwise a cluster with existing flag will not bootstrap.

I think that all clusters shouldn't set this flag unless they are bootstrapped, so we should fix puppet at some point..

To summarize, I think that we have two options:

Add support to spicerack for add/remove members of a etcd ensemble, to be executed right before the reimage (in theory it should work fine). This would allow us to reimage one node at the time without too many issues, preserving the status of the cluster.

Wait for a multi/parallel reimage capability in cookbooks, and always reimage the cluster hitting all nodes at the same time. This would work and it would be quick, but it wouldn't preserve data.

I'd be in favor of option one since it would allow us to preserve data (if needed).

In T330060#8667488, @elukey wrote:

I'd be in favor of option one since it would allow us to preserve data (if needed).

+1, this could maybe even be a separate cookbook for upgrading etcd in general (and we just call it from the k8s upgrade one), no?

Clement_Goubert triaged this task as Medium priority.Mar 15 2023, 11:59 AM

Clement_Goubert moved this task from Incoming 🐫 to ⎈Kubernetes on the serviceops board.

JArguello-WMF moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.Jun 29 2023, 1:43 PM

JArguello-WMF moved this task from To be discussed to Epics on the Shared-Data-Infrastructure board.Jun 29 2023, 1:45 PM

JArguello-WMF removed a project: Shared-Data-Infrastructure.Jun 29 2023, 1:46 PM

JMeybohm added a parent task: T341984: Update Kubernetes clusters to >1.25.Jul 17 2023, 12:26 PM

etcd cluster reimage strategies to use with the K8s upgrade cookbookOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

etcd cluster reimage strategies to use with the K8s upgrade cookbook
Open, MediumPublic
Actions

Related Objects
Search...