Page MenuHomePhabricator

Remove 2 nodes from the tools-k8s-etcd cluster
Closed, ResolvedPublic


Since redundancy isn't going to make the cluster run better, and it is (if anything) running worse at the moment than it was, can we please try removing two nodes from the cluster (especially making sure that tools-k8s-etcd-8 is among them because that one is still using the deprecated storage driver).

Fewer nodes will speed up the cluster at the expense of redundancy, but we should survive on 3 nodes. Currently it rides at an iowait fo between 7 and (no kidding) 43%. fsync is usually acceptable, but it occasionally rises badly. I figured this should best be done with script to prevent error and alerts. When this is done, I plan on revisiting the etcd tuning variables to see what else can be done.

If I can make the cluster stop sucking, maybe I'll set up backups for it to make up for whatever redundancy nonsense. I've been too afraid to until now for fear of collapsing it.

Event Timeline

Andrew triaged this task as High priority.Tue, Apr 13, 4:06 PM
Andrew moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

@dcaro If you manage to do this on Monday before I'm back, that's great. If not, can you walk me through doing it with your scripts on Tuesday in your afternoon/my morning?

I was going to use this as part of the presentation about spicerack/cookbooks for the team, force one of you to setup
and run the cookbook :)

Mentioned in SAL (#wikimedia-cloud) [2021-04-29T18:12:25Z] <bstorm> removing an etcd node via cookbook T279723

Mentioned in SAL (#wikimedia-cloud) [2021-04-29T18:23:47Z] <bstorm> removing one more etcd node via cookbook T279723

Hopefully that'll speed some responses up. It's still running awfully high iowait.

Change 683705 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] toolschecker: update the etcd cluster

Change 683705 merged by Bstorm:

[operations/puppet@production] toolschecker: update the etcd cluster