Page MenuHomePhabricator

toolforge: some additional testing before final migration
Closed, DeclinedPublic

Description

It could be interesting to have some "realworld" testing on the new kubernetes cluster in toolsbeta before starting the final migration in the tools project.

The toolsbeta cluster is currently scaled to 3 control nodes, 2 worker nodes, 3 etcd servers, 1 haproxy, 1 front proxy (dynamicproxy).

Some ideas and questions I would like to see answered:

  • how many request can handle the north-south proxy setup? i.e, front proxy (dynamicproxy) + haproxy + ingress. We have grafana dashboard to measure this: https://grafana-labs.wikimedia.org/d/R7BPaEbWk/toolforge-ingress?refresh=1m&orgId=1 but we don't have test tool running there yet.
  • how many pods can we run in just a couple worker nodes? how oversubscribing mem and CPU works in this new cluster? We have a grafana dashboard to measure this: https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1 but we don't have any test tool running there yet do do actual tests.
  • how is nginx-ingress behaving when hundred of ingress objects are being created/removed? This was answered in T239405: toolforge: new k8s: evaluate ingress controller reload behaviour
  • what happens when we scale up/down the cluster? Is service interrupted in any way? Specifically when adding / removing control and worker nodes.
  • the haproxy setup is not HA. We have cold standby server. How long (how bad) is the service interrupted in case of failover?
  • the frontproxy (dynamicproxy) setup is not HA. We have a cold standby server. How long (how bad) is the service interrupted in case of failover?
  • estimate/test the service impact of relocating a VM to a different cloudvirt (etcd, worker, control, haproxy, frontproxy)

Event Timeline

aborrero triaged this task as Medium priority.Nov 22 2019, 10:14 AM

Change 554041 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dynamicproxy: add backend information to access log entries

https://gerrit.wikimedia.org/r/554041

Change 558676 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Update toolviews.py nginx log parser

https://gerrit.wikimedia.org/r/558676

Change 558676 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: Update toolviews.py nginx log parser

https://gerrit.wikimedia.org/r/558676

Change 554041 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dynamicproxy: add backend information to access log entries

https://gerrit.wikimedia.org/r/554041

The new k8s cluster is fully working now.

We had no time to tests/answer many of the items stated in the task description, but things look good so far anyway!