It could be interesting to have some "realworld" testing on the new kubernetes cluster in toolsbeta before starting the final migration in the tools project.
The toolsbeta cluster is currently scaled to 3 control nodes, 2 worker nodes, 3 etcd servers, 1 haproxy, 1 front proxy (dynamicproxy).
Some ideas and questions I would like to see answered:
- how many request can handle the north-south proxy setup? i.e, front proxy (dynamicproxy) + haproxy + ingress. We have grafana dashboard to measure this: https://grafana-labs.wikimedia.org/d/R7BPaEbWk/toolforge-ingress?refresh=1m&orgId=1 but we don't have test tool running there yet.
- how many pods can we run in just a couple worker nodes? how oversubscribing mem and CPU works in this new cluster? We have a grafana dashboard to measure this: https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1 but we don't have any test tool running there yet do do actual tests.
- how is nginx-ingress behaving when hundred of ingress objects are being created/removed? This was answered in T239405: toolforge: new k8s: evaluate ingress controller reload behaviour
- what happens when we scale up/down the cluster? Is service interrupted in any way? Specifically when adding / removing control and worker nodes.
- the haproxy setup is not HA. We have cold standby server. How long (how bad) is the service interrupted in case of failover?
- the frontproxy (dynamicproxy) setup is not HA. We have a cold standby server. How long (how bad) is the service interrupted in case of failover?
- estimate/test the service impact of relocating a VM to a different cloudvirt (etcd, worker, control, haproxy, frontproxy)