Page MenuHomePhabricator

Toolforge k8s: Migrate workers to Containerd and Bookworm
Closed, ResolvedPublic

Description

To upgrade Kubernetes 1.24 we need to upgrade Toolforge workers to Containerd. We need Debian 12 for a new enough Containerd version.

toolsbeta

  • control
  • worker
  • ingress

tools

  • control
  • worker
  • ingress

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For this, we currently rely on docker settings to manage log length in containerd, much like prod does. We will want to find an equivalent later because some tools are otherwise very good at filling worker nodes (an old problem around here T148487). logrotate can handle it, but docker was quite good at it with fewer failures waiting for a logrotate run (yes people crashed k8s nodes between logrotate runs regularly, typically using java).

Whatever you use to solve that problem (some containerd setting or podman thing?), just know that our users can certainly outsmart logrotate by mistake.

Oh yeah, please don't remove docker-ce from the repos unless you account for the harbor use of it, also. It's running in docker compose and currently using our kubeadm components to do it.

Oh yeah, please don't remove docker-ce from the repos unless you account for the harbor use of it, also. It's running in docker compose and currently using our kubeadm components to do it.

I was planning on just using what Debian packages, https://packages.debian.org/bullseye/docker.io and https://apt-browser.toolforge.org/buster-wikimedia/thirdparty/kubeadm-k8s-1-20/ both seem recent enough. Good to know Harbor needs this though, thanks!

Beware: Kubernetes 1.24 requires containerd v1.6.4+ or v1.5.11+, while Bullseye repositories have 1.4.13. Bookworm (bullseye + 1) will ship with 1.6, or we might need to use third-party packages which I'd really rather not do.

taavi renamed this task from Toolforge k8s: Migrate from Docker to Containerd to Toolforge k8s: Migrate workers to Containerd and Bookworm.Oct 18 2023, 1:04 PM
taavi raised the priority of this task from Low to Medium.
taavi updated the task description. (Show Details)
taavi removed a subscriber: Bstorm.

Change 967875 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] aptrepo: Import kubeadm 1.23 for bookworm

https://gerrit.wikimedia.org/r/967875

Change 968618 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::kubeadm: rely on iptables-nft on bookworm

https://gerrit.wikimedia.org/r/968618

Change 967875 merged by Majavah:

[operations/puppet@production] aptrepo: Import kubeadm 1.23 for bookworm

https://gerrit.wikimedia.org/r/967875

Mentioned in SAL (#wikimedia-operations) [2023-10-25T10:02:03Z] <taavi> import kubernetes 1.23 packages for debian bookworm T284656

Change 968623 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::kubeadm: install containerd on bookworm

https://gerrit.wikimedia.org/r/968623

Change 968618 merged by Majavah:

[operations/puppet@production] P:wmcs::kubeadm: rely on iptables-nft on bookworm

https://gerrit.wikimedia.org/r/968618

Change 968623 merged by Majavah:

[operations/puppet@production] P:wmcs::kubeadm: install containerd on bookworm

https://gerrit.wikimedia.org/r/968623

Change 968634 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] kubeadm: only install containerd.io with docker

https://gerrit.wikimedia.org/r/968634

Change 968635 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] kubeadm: containerd: install br_netfilter kmod

https://gerrit.wikimedia.org/r/968635

Change 968647 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] kubeadm: add required config for containerd

https://gerrit.wikimedia.org/r/968647

The above patches make it possible to provision a new host on bookworm. There are a couple of issues however:

  • cadvisor does not work
  • the extra volume for /var/lib/docker has not been ported yet
  • I haven't checked if the log file max size still works

Change 968634 merged by Majavah:

[operations/puppet@production] kubeadm: only install containerd.io with docker

https://gerrit.wikimedia.org/r/968634

Change 968635 merged by Majavah:

[operations/puppet@production] kubeadm: containerd: add kernel modules and config

https://gerrit.wikimedia.org/r/968635

Change 968647 merged by Majavah:

[operations/puppet@production] kubeadm: add required config for containerd

https://gerrit.wikimedia.org/r/968647

dcaro changed the task status from Open to In Progress.Jan 18 2024, 5:06 PM

cadvisor does not work

Fixed with the upgrade.

I haven't checked if the log file max size still works

According to the Kubernetes docs, "containerLogMaxSize is a quantity defining the maximum size of the container log file before it is rotated. For example: "5Mi" or "256Ki". If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may trigger log rotation. Default: "10Mi"". So I think we're fine.

So that leaves the extra volume. It seems like Containerd spreads what Docker stores in /var/lib/docker to a few different places so we need to workaround that somehow.

Change 992633 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::kubeadm: worker: support containerd separate volume

https://gerrit.wikimedia.org/r/992633

Change 992633 merged by Majavah:

[operations/puppet@production] P:wmcs::kubeadm: worker: support containerd separate volume

https://gerrit.wikimedia.org/r/992633

Change 992923 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: add_k8s_node: Add support for containerd

https://gerrit.wikimedia.org/r/992923

Change 992926 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: add_k8s_node: Allow passing --network

https://gerrit.wikimedia.org/r/992926

Change 992923 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: add_k8s_node: Add support for containerd

https://gerrit.wikimedia.org/r/992923

Change 992926 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: add_k8s_node: Allow passing --network

https://gerrit.wikimedia.org/r/992926

Mentioned in SAL (#wikimedia-cloud-feed) [2024-02-22T09:29:38Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.add_k8s_node for a control role in the tools cluster (T284656)

Running cookbook:

aborrero@cloudcumin1001:~ $ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name tools --task-id T284656 --role control

If this were the first bookworm control node, we would add the argument --image debian-12.0-bookworm, but since it is not, the cookbook will use the last node image.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-02-22T11:23:02Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.remove_k8s_node (T284656)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-02-22T11:23:50Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) (T284656)

I plan to upgrade the last control node next monday 2024-02-26.

Change 1005766 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] inventory: refresh tools k8s control nodes

https://gerrit.wikimedia.org/r/1005766

Change 1005766 merged by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] inventory: refresh tools k8s control nodes

https://gerrit.wikimedia.org/r/1005766

Change 1005954 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: k8s: Support containerd as container runtime

https://gerrit.wikimedia.org/r/1005954

Change 1005954 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: k8s: Support containerd as container runtime

https://gerrit.wikimedia.org/r/1005954

Mentioned in SAL (#wikimedia-cloud-feed) [2024-02-26T09:26:11Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.add_k8s_node for a control role in the tools cluster (T284656)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-02-26T09:53:59Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.toolforge.remove_k8s_node (T284656)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-02-26T09:54:45Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) (T284656)

aborrero closed this task as Resolved.EditedFeb 26 2024, 10:52 AM
aborrero updated the task description. (Show Details)

This is done:

aborrero@tools-sgebastion-11:~$ kubectl sudo get nodes -o wide | grep control
tools-k8s-control-7       Ready    control-plane,master   5d1h    v1.23.17   172.16.0.144   <none>        Debian GNU/Linux 12 (bookworm)   6.1.0-18-cloud-amd64   containerd://1.6.20
tools-k8s-control-8       Ready    control-plane,master   4d1h    v1.23.17   172.16.5.194   <none>        Debian GNU/Linux 12 (bookworm)   6.1.0-18-cloud-amd64   containerd://1.6.20
tools-k8s-control-9       Ready    control-plane,master   77m     v1.23.17   172.16.3.135   <none>        Debian GNU/Linux 12 (bookworm)   6.1.0-18-cloud-amd64   containerd://1.6.20

Change 1011138 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] kubeadm: Drop buster support

https://gerrit.wikimedia.org/r/1011138

Change 1011138 merged by Majavah:

[operations/puppet@production] kubeadm: Drop buster support

https://gerrit.wikimedia.org/r/1011138