Page MenuHomePhabricator

Toolforge: k8s: ingress: consider creating ingress-specific nodes
Open, MediumPublic

Description

Right now, when an user connects to a webservice running in Toolforge kubernetes, this happens:

client --> tools front proxy --> haproxy --> random k8s worker node --> ingress pod on a random worker node --> tool webservice on a random worker node

There is an extra overhead in the haproxy --> k8s worker node --> ingress pod step, because haproxy doesn't know in which node is the ingress pod running, we use a nodePort and let the ingress listen in every node of the cluster.

As of this writing, we have about 55 k8s worker nodes and only 3 ingress pods. The chances that haproxy hits a node with ingress running is pretty low, thus requiring another internal kubernetes forward to the correct node with a running ingress pod.

One simple way to solve this is to create ingress-specific nodes, nodes in which we only run nginx-ingress (plus related monitoring), and configure haproxy to redirect to those nodes only, instead of to all worker nodes. As a side effect, the nginx-ingress pods would be more relaxed from memory pressure (they use at least 1Gb memory, and growing).

NOTE: we didn't detect any performance impact of this setup, is just a possible improvement. Actually, we decided against this solution when we originally developed the ingress, but may worth revisiting now the usage is growing.

Docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_ingress

Event Timeline

aborrero triaged this task as Lowest priority.Apr 14 2020, 1:26 PM

On the 2020-04-29 WMCS meeting we decided this is something interesting to explore + using openstack server groups to ensure ingress nodes aren't in the same hypervisor.
We also agreed this is a low priority change, so we wont work on this in the short term?

Change 604665 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: rename hiera key for ingress nodes

https://gerrit.wikimedia.org/r/604665

Change 604665 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] kubeadm: rename hiera key for ingress nodes

https://gerrit.wikimedia.org/r/604665

aborrero raised the priority of this task from Lowest to High.

Raising priority, @Andrew mentioned that with the new domains for VMs we should try creating new k8s nodes and see how that works. This seems like the right test.

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T10:38:56Z] <arturo> created server group tools-ingress with soft anti affinity policy (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T10:42:01Z] <arturo> created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the tools-ingress server group T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T10:50:26Z] <arturo> created puppet prefix tools-k8s-ingress (T250172)

Change 626133 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: run nginx-ingress on ingress dedicated nodes

https://gerrit.wikimedia.org/r/626133

mmm will do toolsbeta first just in case.

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T11:24:37Z] <arturo> created new puppet prefix toolsbeta-test-k8s-ingress (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T11:25:50Z] <arturo> created new server group toolsbeta-k8s-ingress (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-09T11:27:56Z] <arturo> created 2 VMs: toolsbeta-test-k8s-ingress-1 and toolsbeta-test-k8s-ingress-2 (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-10T08:59:49Z] <arturo> added toolsbeta-test-k8s-ingress-1 (and -2) to the k8s cluster (T250172)

Mentioned in SAL (#wikimedia-cloud) [2020-09-10T09:00:56Z] <arturo> tainted/labeld toolsbeta-test-k8s-ingress-1 (and -2) in the k8s cluster (T250172)

Change 626133 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: run nginx-ingress on ingress dedicated nodes

https://gerrit.wikimedia.org/r/626133

Mentioned in SAL (#wikimedia-cloud) [2020-09-10T10:22:13Z] <arturo> enabling ingress dedicated worker nodes in the k8s cluster (T250172)

I'm not sure that this works in the way that is expected. If I'm undestanding correctly what is hoped is that:
client --> tools front proxy --> haproxy --> random k8s worker node --> ingress pod on a random worker node --> tool webservice on a random worker node
Will become:
client --> tools front proxy --> haproxy --> random k8s ingress node/pod on that node --> tool webservice on a random worker node
Thus cutting out a network hop. But it is my understanding that we end up with:
client --> tools front proxy --> haproxy --> random k8s ingress node --> ingress pod on a random (1/3 the time the same) ingress node --> tool webservice on a random worker node

I was running a test that seems to confirm this, though if anyone wants to look at it with me that would be great.

I think the problem you are describing is 100% legit, specially nowadays, that we scaled up the number of ingress nodes/pods.

In summary:

  • this optimization doesn't actually optimize anything anymore.
  • on the last few k8s upgrades we did, it was mentioned that the extra handling that ingress nodes need is complex.

The only value I can think of right now is our ability to ensure that ingress pods always run on different cloudvirt hypervisors. I think we do soft-anti-affinity for k8s-worker VMs (because they outnumber cloudvirts) and hard anti-affinity for k8s-ingress.

I'm fine if we decide to drop/revert this setup and therefore simplify our upgrading process.

aborrero lowered the priority of this task from High to Medium.Oct 4 2021, 10:30 AM
aborrero updated Other Assignee, added: mdipietro.

Cool, I'll dig in more on how we deploy the base VMs. But I believe you're right, and also believe that Brooke agrees, that the network traffic is enough to justify this setup. Thanks for the info.