Page MenuHomePhabricator

[lima-kilo] when using "--ha", some containers are not restarting after restarting the VM
Open, LowPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Create a new lima-kilo VM with ./start-devenv.sh --ha
  • Wait for Ansible to provision everything
  • Stop the VM with limactl stop lima-kilo, then restart it with limactl start lima-kilo

What happens?:

  • Some of the Docker containers managed by Kind are not restarted
  • This leads to kubectl not working at all:
fran@lima-kilo:/Users/fran/wmf/lima-kilo$ kubectl get pods
The connection to the server 127.0.0.1:35861 was refused - did you specify the right host or port?

What should have happened instead?:

  • All containers should come back up after restarting the VM

Other information

Missing containers can be listed with docker ps -f "status=exited".

Manually restarting them with docker start {containername} fixes the issue does not fix all issues, because then I get:

fran@lima-kilo:/Users/fran/wmf/lima-kilo$ kubectl get pods
Unable to connect to the server: EOF

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Create single-node clusters by defaultrepos/cloud/toolforge/lima-kilo!223fnegrino-hamain
Customize query in GitLab

Event Timeline

fnegri updated the task description. (Show Details)
fnegri updated the task description. (Show Details)
fnegri updated the task description. (Show Details)
fnegri added a subscriber: aborrero.

Restarting all kind containers, haproxy is still having issues connecting to the k8s control plane:

fran@lima-kilo:~$ docker logs toolforge-external-load-balancer

[...]

[WARNING] 028/183133 (8) : Server kube-apiservers/toolforge-control-plane is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
fnegri changed the task status from Open to In Progress.Jan 29 2025, 6:41 PM
fnegri claimed this task.
fnegri triaged this task as High priority.

Sometimes haproxy is able to connect to some nodes but not all:

[WARNING] 029/110904 (8) : Server kube-apiservers/toolforge-control-plane2 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 029/110905 (8) : Server kube-apiservers/toolforge-control-plane3 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 029/110906 (8) : Server kube-apiservers/toolforge-control-plane is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 029/110906 (8) : backend 'kube-apiservers' has no server available!
[WARNING] 029/110938 (8) : Server kube-apiservers/toolforge-control-plane is UP, reason: Layer7 check passed, code: 200, check duration: 4ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 029/110954 (8) : Server kube-apiservers/toolforge-control-plane2 is UP, reason: Layer7 check passed, code: 200, check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Kubelet logs from the container that is detected as DOWN:

fran@lima-kilo:/Users/fran$ docker exec -it toolforge-control-plane3 journalctl -u kubelet -f
Jan 30 11:15:26 toolforge-control-plane3 kubelet[206]: I0130 11:15:26.217807     206 scope.go:117] "RemoveContainer" containerID="d9f7da697acdc4a1f2fc01f657cd21da92802a88b0bde3436cd0dbc9d9b9740c"
Jan 30 11:15:26 toolforge-control-plane3 kubelet[206]: E0130 11:15:26.218323     206 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 2m40s restarting failed container=kube-apiserver pod=kube-apiserver-toolforge-control-plane3_kube-system(dd38563e56e20e9d65cf806b36edfc6c)\"" pod="kube-system/kube-apiserver-toolforge-control-plane3" podUID="dd38563e56e20e9d65cf806b36edfc6c"

I was trying to reproduce this, but find it impossible to create deployment using the default call to ./start-devenv.sh as currently documented. For a variety of undetermined problems, I could only bootstrap lima-kilo with ./start-devenv.sh --no-ha --no-cache.

Doing that, I was able to stop/start the lima VM multiple times without problems, all containers were OK and I could interact with the k8s API after the restart.

I can reproduce the problem with --no-cache, so I don't think the cache is the problem.

I cannot reproduce the problem with --no-ha, the single toolforge-control-plane is restarted correctly after limactl stop and limactl start.

It looks like the multi-control-plane setup is a bit flaky. Using --no-ha runs a single control plane, and does not install haproxy at all.

Related upstream issue: https://github.com/kubernetes-sigs/kind/issues/2045:

I don't recommend HA control plane for development (or even multi-node) unless you have realllly specific testing needs (e.g. developing an HA control plane component).

I would consider making --no-ha the default option, until multi-control-node is working more reliably.

fnegri renamed this task from [lima-kilo] some containers are not restarting when restarting the VM to [lima-kilo] when using "--ha", some containers are not restarting after restarting the VM.Feb 3 2025, 4:42 PM
fnegri closed this task as Resolved.
fnegri lowered the priority of this task from High to Low.
fnegri updated the task description. (Show Details)

With the patch above, non-HA becomes the default setup, so this issue becomes less urgent as it will now only happen when explicitly adding --ha to ./start-devenv.sh.