Page MenuHomePhabricator

Toolforge: k8s: ingress: consider creating ingress-specific nodes
Closed, ResolvedPublic

Description

Right now, when an user connects to a webservice running in Toolforge kubernetes, this happens:

client --> tools front proxy --> haproxy --> random k8s worker node --> ingress pod on a random worker node --> tool webservice on a random worker node

There is an extra overhead in the haproxy --> k8s worker node --> ingress pod step, because haproxy doesn't know in which node is the ingress pod running, we use a nodePort and let the ingress listen in every node of the cluster.

As of this writing, we have about 55 k8s worker nodes and only 3 ingress pods. The chances that haproxy hits a node with ingress running is pretty low, thus requiring another internal kubernetes forward to the correct node with a running ingress pod.

One simple way to solve this is to create ingress-specific nodes, nodes in which we only run nginx-ingress (plus related monitoring), and configure haproxy to redirect to those nodes only, instead of to all worker nodes. As a side effect, the nginx-ingress pods would be more relaxed from memory pressure (they use at least 1Gb memory, and growing).

NOTE: we didn't detect any performance impact of this setup, is just a possible improvement. Actually, we decided against this solution when we originally developed the ingress, but may worth revisiting now the usage is growing.

Docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_ingress

Event Timeline

aborrero triaged this task as Lowest priority.Apr 14 2020, 1:26 PM

On the 2020-04-29 WMCS meeting we decided this is something interesting to explore + using openstack server groups to ensure ingress nodes aren't in the same hypervisor.
We also agreed this is a low priority change, so we wont work on this in the short term?

Change 604665 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: rename hiera key for ingress nodes

https://gerrit.wikimedia.org/r/604665

Change 604665 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] kubeadm: rename hiera key for ingress nodes

https://gerrit.wikimedia.org/r/604665

aborrero raised the priority of this task from Lowest to High.

Raising priority, @Andrew mentioned that with the new domains for VMs we should try creating new k8s nodes and see how that works. This seems like the right test.

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T10:38:56Z] <arturo> created server group tools-ingress with soft anti affinity policy (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T10:42:01Z] <arturo> created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the tools-ingress server group T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T10:50:26Z] <arturo> created puppet prefix tools-k8s-ingress (T250172)

Change 626133 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: run nginx-ingress on ingress dedicated nodes

https://gerrit.wikimedia.org/r/626133

mmm will do toolsbeta first just in case.

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T11:24:37Z] <arturo> created new puppet prefix toolsbeta-test-k8s-ingress (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T11:25:50Z] <arturo> created new server group toolsbeta-k8s-ingress (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T11:27:56Z] <arturo> created 2 VMs: toolsbeta-test-k8s-ingress-1 and toolsbeta-test-k8s-ingress-2 (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-10T08:59:49Z] <arturo> added toolsbeta-test-k8s-ingress-1 (and -2) to the k8s cluster (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-10T09:00:56Z] <arturo> tainted/labeld toolsbeta-test-k8s-ingress-1 (and -2) in the k8s cluster (T250172)

Change 626133 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: run nginx-ingress on ingress dedicated nodes

https://gerrit.wikimedia.org/r/626133

Mentioned in SAL (#wikimedia-cloud) [2020-09-10T10:22:13Z] <arturo> enabling ingress dedicated worker nodes in the k8s cluster (T250172)