Remove 2 nodes from the tools-k8s-etcd cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Apr 8 2021, 10:12 PM

Description

Since redundancy isn't going to make the cluster run better, and it is (if anything) running worse at the moment than it was, can we please try removing two nodes from the cluster (especially making sure that tools-k8s-etcd-8 is among them because that one is still using the deprecated storage driver).

Fewer nodes will speed up the cluster at the expense of redundancy, but we should survive on 3 nodes. Currently it rides at an iowait fo between 7 and (no kidding) 43%. fsync is usually acceptable, but it occasionally rises badly. I figured this should best be done with script to prevent error and alerts. When this is done, I plan on revisiting the etcd tuning variables to see what else can be done.

If I can make the cluster stop sucking, maybe I'll set up backups for it to make up for whatever redundancy nonsense. I've been too afraid to until now for fear of collapsing it.

Details

	Subject	Repo	Branch	Lines +/-
	toolschecker: update the etcd cluster	operations/puppet	production	+1 -3

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Duplicate	BUG REPORT	• Bstorm	T266506 Getting "502 Bad Gateway" on Toolforge tools in clusters, including tools ordia and scholia
Resolved		• Bstorm	T267078 Open the ceph throttle a bit for tools-k8s-etcd server
Resolved		• Bstorm	T267966 Try to squeeze better performance out of k8s-etcd nodes
Open		None	T262350 bad failure cases for wmcs custom puppet enc
Resolved		• taavi	T267082 Rebuild Toolforge servers that should not have NFS mounted (and with affinity)
Resolved		dcaro	T279723 Remove 2 nodes from the tools-k8s-etcd cluster

Event Timeline

• Bstorm created this task.Apr 8 2021, 10:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2021, 10:12 PM

• Bstorm added parent tasks: T267966: Try to squeeze better performance out of k8s-etcd nodes, T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity).Apr 8 2021, 10:13 PM

Andrew triaged this task as High priority.Apr 13 2021, 4:06 PM

Andrew moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

• Bstorm mentioned this in T280299: Upgrade Toolforge Kubernetes to latest 1.18.Apr 16 2021, 3:46 PM

@dcaro If you manage to do this on Monday before I'm back, that's great. If not, can you walk me through doing it with your scripts on Tuesday in your afternoon/my morning?

I was going to use this as part of the presentation about spicerack/cookbooks for the team, force one of you to setup
and run the cookbook :)

That works. lol

Mentioned in SAL (#wikimedia-cloud) [2021-04-29T18:12:25Z] <bstorm> removing an etcd node via cookbook T279723

Mentioned in SAL (#wikimedia-cloud) [2021-04-29T18:23:47Z] <bstorm> removing one more etcd node via cookbook T279723

Hopefully that'll speed some responses up. It's still running awfully high iowait.

Change 683705 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] toolschecker: update the etcd cluster

https://gerrit.wikimedia.org/r/683705

gerritbot added a project: Patch-For-Review.Apr 29 2021, 7:08 PM

Change 683705 merged by Bstorm: