Page MenuHomePhabricator

etcd cluster reimage strategies to use with the K8s upgrade cookbook
Open, MediumPublic

Description

In the sre.k8s.upgrade-cluster cookbook there is the possibility of reimaging the etcd cluster (of a given k8s cluster) to a target OS, that in the case of 1.23 is bullseye. We can reimage one node at the time, so the first strategy that I used at first was:

  1. Stop etcd on all nodes and disable puppet.
  2. Reimage nodes one at the time.

The idea was to basically bootstrap a cluster from scratch, and I naively though that it would have worked. One thing that I discovered is that the SRV records env variables in /etc/default/etcd play a role, and when the first etcd node boots for the first time it is very upset to not find any of its other members alive and reachable for a leader election. Setting the environment variable to tag the cluster as new and commenting all the SRV/Discovery related ones seems to work, namely the node bootstraps itself but thinking it is on a single-node cluster. The bootstrap of the rest of the nodes is not very clean, it requires some manual restarts and hacks as well, so I thought to try the following procedure instead:

  1. Stop etcd on one node (while the rest is up).
  2. Reimage the node.
  3. Wait for the cluster to be healthy
  4. goto 1) with another node

The procedure seemed sound but I didn't know that another problem may have arisen, namely that after reimaging the first node it won't have bootstrapped correctly since the Raft log's last commit/id wasn't matching the one provided by the rest of the nodes. After reading some guides I discovered that simply removing/adding the member from the cluster is sufficient to let it bootstrap with a brand new Raft log (that would be synced with the one provided by the rest of the cluster). For the moment adding/removing nodes is not well supported by Spicerack IIRC, but maybe we could add some support.

Does anybody else have a different experience with etcd? Is there another procedure to do it safely? We should probably add something to https://wikitech.wikimedia.org/wiki/Etcd at the end of the task.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The experience above matches my own, when I have to add/remove nodes from the cluster. IIRC when reimaging the cluster though and given we don't care at all about preserving the content in it, reimaging all the 3 nodes together works fine. We should probably retest my memory when re-imaging eqiad.

The experience above matches my own, when I have to add/remove nodes from the cluster. IIRC when reimaging the cluster though and given we don't care at all about preserving the content in it, reimaging all the 3 nodes together works fine. We should probably retest my memory when re-imaging eqiad.

Yeah, that kind of works (I did that in staging and wikikube-codfw) but it also lead to at least one "first puppet run failed" error during reimage as the fist node to become ready will most likely fail. It's all about timing in that case I suppose.

@JMeybohm did you have to set the cluster's status to new by any chance? I had to for ml-etcd2* nodes, otherwise their etcd daemons didn't want to bootstrap..

Edit: yes I see profile::etcd::v3::cluster_bootstrap: true set in various clusters. I'll add to the docs, it needs to be set otherwise a cluster with existing flag will not bootstrap.

I think that all clusters shouldn't set this flag unless they are bootstrapped, so we should fix puppet at some point..

To summarize, I think that we have two options:

  • Add support to spicerack for add/remove members of a etcd ensemble, to be executed right before the reimage (in theory it should work fine). This would allow us to reimage one node at the time without too many issues, preserving the status of the cluster.
  • Wait for a multi/parallel reimage capability in cookbooks, and always reimage the cluster hitting all nodes at the same time. This would work and it would be quick, but it wouldn't preserve data.

I'd be in favor of option one since it would allow us to preserve data (if needed).

I'd be in favor of option one since it would allow us to preserve data (if needed).

+1, this could maybe even be a separate cookbook for upgrading etcd in general (and we just call it from the k8s upgrade one), no?

Clement_Goubert moved this task from Incoming 🐫 to ⎈Kubernetes on the serviceops board.