Page MenuHomePhabricator

toolforge: latest k8s worker node have networking issues
Closed, ResolvedPublic

Description

As part of T329357: Toolforge: scale down grid engine nodes, scale up k8s workers (mid February 2023) we created 3 new worker nodes.

All of them have networking problems. Are they the first nodes created using the automated cookbook? If so, it may indicate some missing step somewhere.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2023-02-13T13:15:52Z] <arturo> cordoned & drained k8s workers 4 to 7 to force workload to relocate to 8 (T329378)

A new worker in toolsbeta works apparently. So I can't reproduce the problem in toolsbeta.

Next thing I would try is to schedule (by hand?) a webservice pod in one of the newer worker nodes and try debugging why it wouldn't work.

This turned out to be a security group issue. The cookbook gave the workers the 'tools-k8s-full-connectivity' security group, but it should have added 'tools-new-k8s-full-connectivity'. I've fixed those manually, but leaving the task open to fix the cookbook in one way or another.

Mentioned in SAL (#wikimedia-cloud) [2023-02-19T09:16:26Z] <taavi> uncordon tools-k8s-worker-[80-82] after fixing security groups T329378

taavi claimed this task.

The cookbook was fixed.