Page MenuHomePhabricator

Migrate wikikube control planes to hardware nodes
Open, HighPublic

Assigned To
Authored By
JMeybohm
Dec 14 2023, 4:09 PM
Referenced Files
F48799280: etcd-benchmark-output-kubestagemaster2003.txt
Fri, Apr 26, 2:58 PM
F48799281: etcd-benchmark-output-kubestagemaster2003_isolated.txt
Fri, Apr 26, 2:58 PM
F48799282: etcd-benchmark-output-ganeti-test2003.txt
Fri, Apr 26, 2:58 PM
F48799283: etcd-benchmark-output-mw2391.txt
Fri, Apr 26, 2:58 PM
F41610964: image.png
Dec 18 2023, 1:37 PM
F41610962: image.png
Dec 18 2023, 1:37 PM
F41610960: image.png
Dec 18 2023, 1:37 PM
F41610958: image.png
Dec 18 2023, 1:37 PM

Description

Currently we run 2 control planes as well as 3 etcd nodes per DC as ganeti VMs. We already hit limits in terms of IOPS on the etcd instances and we do scratch on the upper "limit" for memory on ganeti for the control planes (12GB currently).

We should draft a plan to migrate from the 2+3 ganeti instances to 3 hardware nodes (repurposing mw appservers) and co-locate a kubernetes master and etcd sever on each of them.

It should be possible to do this by adding the new control-planes/etcd nodes and remove the ganeti ones after.

In the spreadsheet at T351074: Move servers from the appserver/api cluster to kubernetes I've reserved 3 R440 nodes per DC to be used as apiservers:

  • mw2391
  • mw2331
  • mw2361
  • mw1372
  • mw1429
  • mw1436

These should be renamed during reimage because of their special role in the cluster:

I wrote a documentation on how to add stacked control-planes and how to remove them as well as etcd nodes at: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_control-planes

For preparation we should reimage the above appservers to insetup using the same partition layout as we use for kubernetes workers.

What I totally failed to think about while doing staging is the opportunity to align wikikube control-plane names with the other clusters which use names like ml-serve-ctrlXXXX/aux-k8s-ctrlXXXX. So maybe we could rename to wikikube-ctrlXXX (I really don't like the k8s that dse and aux threw in the mix) to come one step closer to T336861: Fix naming confusion around main/wikikube kubernetes clusters.

Event Timeline

JMeybohm triaged this task as Medium priority.Dec 14 2023, 4:09 PM
JMeybohm created this task.
JMeybohm renamed this task from Migtate wikikube control planes to hardware nodes to Migrate wikikube control planes to hardware nodes.Dec 14 2023, 4:11 PM

I am not so sure we actually do scratch that memory limit now. Looking at kubemaster2001 last week

image.png (1×1 px, 96 KB)

and the other 3 kubernetes masters

image.png (1×1 px, 89 KB)

image.png (1×1 px, 103 KB)

image.png (1×1 px, 91 KB)

So, at the very least we got ~50% of the VMs memory capacity before hitting problems again.

Upper memory that we can handle nicely in Ganeti is by experience ~16GB btw. But this is mostly because most applications after than number tend to both consume a lot of memory AND alter the contents of the memory faster than the migration algorithm can catch up with ending with either stuck or at least very long running migrations.

CPU usage also fell to ~20% now so we got some room to spare and think how and when we want to tackle this.

I am not so sure we actually do scratch that memory limit now. Looking at kubemaster2001 last week

I wasn't saying we do. There are still quite a number of nodes to come, even with those I won't suspect us hitting 16 or 12GB. But with the IOPS bottlenecks we saw with etcd on ganeti we probably need to move etcd servers to hardware and in that case it does not make sense to not move control-planes as well IMHO.

Forgive me for the drive-by comment, but would it be possible to create high IOPS tiers for Ganeti (RAID-0?) I'd recommend deploying in conjunction with non-DRDB VMs for services that have their own HA (such as Kubernetes control plane). I bring it up as I feel like Ganeti is an underused resource, and using it helps to avoid some of the management overhead associated with physical machines.

Forgive me for the drive-by comment, but would it be possible to create high IOPS tiers for Ganeti (RAID-0?) I'd recommend deploying in conjunction with non-DRDB VMs for services that have their own HA (such as Kubernetes control plane). I bring it up as I feel like Ganeti is an underused resource, and using it helps to avoid some of the management overhead associated with physical machines.

Yes, probably. But there would still be overhead and potential noisy neighbors on ganeti. With etcd being very sensitive in terms of IOPS this might still not give us the desired performance. Also we would need to build such a ganeti system (multiple to cover our redundancy needs).

I am not so sure we actually do scratch that memory limit now. Looking at kubemaster2001 last week

I wasn't saying we do. There are still quite a number of nodes to come, even with those I won't suspect us hitting 16 or 12GB.

Agreed. That's my current theory as well.

But with the IOPS bottlenecks we saw with etcd on ganeti we probably need to move etcd servers to hardware and in that case it does not make sense to not move control-planes as well IMHO.

Oh, so collocate etcd with the rest of the control-plane. It will work ofc, and may even make our puppetization simpler. Just not super sold on it yet.

Do we track the IOPS bottlenecks we witnessed in some task?

Do we track the IOPS bottlenecks we witnessed in some task?

I'm also curious about the IOPS issues, since I assume the majority of etcd instances out in the wild are running on VMs and shared hardware. Might not worth be the effort for this particular project, but I'm game to help build a higher IOPS tier for Ganeti if anyone else thinks that would be helpful.

Do we track the IOPS bottlenecks we witnessed in some task?

Track no, but what triggered creating T348466: Rethink kubernetes etcd storage was investigating T348228: KubernetesAPILatency alert fires on scap deploy.

Do we track the IOPS bottlenecks we witnessed in some task?

Track no, but what triggered creating T348466: Rethink kubernetes etcd storage was investigating T348228: KubernetesAPILatency alert fires on scap deploy.

Thanks, that's the context I was missing.

I ran a couple of very basic benchmarks (commands in the attached filed) against single node etcd instances running on:

  • A mediawiki application server, ex4 on LVM on RAID1

  • A (empty) ganeti node, ext4 on LVM on RAID5

  • A ganeti instance running as the only instance on the above node, ext4 on LVM on RAID5

  • A ganeti instance running in the prod ganeti cluster (together with other instances), ext4 on LVM on RAID5

tl;dr is:
Puts with 3 clients, 100 connections (roughly k8s prod) show Requests/sec: 3963.8976 vs. 6750.9583 vs. 11895.5644 (isolated vm, ganeti node, appserver) and p99.9 latency: 0.1026s vs. 0.0599s vs. 0.0399s

hnowlan updated the task description. (Show Details)

Change #1032805 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] appservers: 6 appservers to insetup before reimaging

https://gerrit.wikimedia.org/r/1032805