Migration to containerd and away from docker
Open, HighPublic
Actions

Assigned To

Authored By

	akosiaris
	Apr 12 2024, 1:44 PM

Description

Per T269684 we need to move away from docker. In February 2024, the serviceops team announced the results of the evaluation of the candidate replacement engines. Results and criteria have been documented in Kubernetes/CRE. The chosen container runtime engine was containerd. This task describes the plan for the migration and tracks the migration process itself

Plan

containerd upgrade

We 'll probably need a new profile profile::containerd or similar.
Create proper cgroups config for containerd (https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd)
Handle pulling of restricted images with containerd (provide authentication credentials etc)
Test integration with dragonfly/dfget

for the actual upgrade

Run some workers (4 in codfw as a start) with bookworm, to surface potential OS related issues
1. wikikube-worker2085 (R440)
2. wikikube-worker2086 (R440)
3. wikikube-worker2088 (Supermicro)
4. wikikube-worker2089 (R450)
Create puppetization for the configuration required by kubernetes
Reimage some nodes with bookworm + containerd (>=1.6)
Upgrade all clusters to the newer containerd, rolling-reimage of nodes

nerdctl

Docker has a relatively user friendly CLI. containerd doesn't. the ctr tool it ships with is a lower level, albeit useful tool. nerdctl, is a CLI released by the containerd project that is CLI compatible with docker CLI

Package nerdctl. Probably utilizing our Upstream binaries policy to avoid the onus of having to build every since dependency
Use puppet to install the package and populate a nerdctl configuration file /etc/nerdctl/nerdctl.toml to default to namespace k8s.io
Test and approve.

crictl

Kubernetes build crictl/cri-tools https://github.com/kubernetes-sigs/cri-tools/tree/master to interact with a CRI the way kubelet would. In my initial tests with nerctl it did not completely honor all containerd configuration (like registry mirrors and authentication we require for dragonfly). So I decided to also package cricrl and have it installed on all nodes.

Kubelet (the above are a prereq)

Amend puppet to have behind a feature flag the following 2 parameters

--container-runtime-endpoint=unix:///run/containerd/containerd.sock 
--container-runtime=remote

Metrics

Replace kubelet_docker_operations_* with kubelet_runtime_operations_*

Log processing

Parsing of logs does not work properly with containerd nodes. Logs that usually have the k8s_docker_log_field_parsed tag don't have it anymore:

T377132: containerd logs are not properly parsed during ingestion to logstash

Things to do after all k8s nodes have been migrated off of docker

Remove puppet classes no longer in use (if there are any)
Ensure all profile::docker::engine related hiera keys are gone (as well as profile::kubernetes::node::docker_kubernetes_user_password)

How to migrate to containerd

https://wikitech.wikimedia.org/wiki/Kubernetes/Administration/containerd_migration

Details

Subject	Repo	Branch	Lines +/-
wikikube: Default to containerd partition layout	operations/puppet	production	+30 -42
k8s.reimage-stacked-control-plane: Ask for the management password early	operations/cookbooks	master	+4 -0
k8s.reimage-stacked-control-plane: Wait for 3m after depooling	operations/cookbooks	master	+6 -0
k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp	operations/cookbooks	master	+25 -8
wikikube-staging: Remove obsolete docker hiera config	operations/puppet	production	+0 -9
Add a cookbook to roll-reimage stacked k8s control planes	operations/cookbooks	master	+307 -0
k8s.pool-depool-node: Add support for multiple nodes	operations/cookbooks	master	+161 -106
Migrate wikikube-worker208[5689] to containerd	operations/puppet	production	+5 -16
etcd::v3: Don't set trusted-ca-file if client-cert-auth is false	operations/puppet	production	+8 -0
etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4	operations/puppet	production	+2 -2
wikikube: Remove explicit container_runtime config	operations/puppet	production	+0 -2
wikikube: Prepare clusters for containerd workers	operations/puppet	production	+24 -9
containerd: Remove container log line length limit	operations/puppet	production	+4 -0
dragonfly::dfdaemon: Refactor docker integration	operations/puppet	production	+27 -23
dragonfly::dfdaemon: Enable by default when profile is included	operations/puppet	production	+1 -5
Remove role kubernetes::staging::worker_containerd	labs/private	master	+0 -1
cumin/aliases: Merge worker_containerd back to worker role	operations/puppet	production	+4 -4
kubernetes: Create profile::kubernetes::container_runtime	operations/puppet	production	+56 -82
Merge worker_containerd back to worker role	labs/private	master	+2 -1
cumin/aliases: Remove P{O:kubernetes::staging::worker}	operations/puppet	production	+1 -1
Migrate kubestage1004 to containerd	operations/puppet	production	+1 -53
Migrate kubestage1003 to containerd	operations/puppet	production	+7 -3
k8s/kubelet: Remove absent containerd specific systemd override	operations/puppet	production	+0 -7
k8s/kubelet: Make kubelet.service depend on container runtime	operations/puppet	production	+1 -1
k8s/kubelet: Make kubelet.service depend on container runtime	operations/puppet	production	+14 -3
Remove kubelet systemd unit dependency to docker.service	operations/debs/kubernetes	v1.23	+10 -3
wikikube-staging-codfw: Migrate kubestage2002 to containerd	operations/puppet	production	+1 -1
wikikube-staging-codfw: Migrate kubestage2002 to containerd	operations/puppet	production	+1 -6
kubernetes/staging: Add role master_stacked_containerd	operations/puppet	production	+60 -0
containerd: Enable unprivileged icmp and binding to ports < 1024	operations/puppet	production	+15 -0
cumin/aliases: Add containerd roles to wikikube aliases	operations/puppet	production	+4 -4
kubelet: Remove --pod-infra-container-image when using containerd	operations/puppet	production	+4 -3
kubelet/containerd: Fix registry authentication	operations/puppet	production	+17 -5
kubernetes::worker_containerd: Fix registry_auth hiera key	labs/private	master	+2 -2
kubelet/containerd: Fix runc config and kubelet systemd unit	operations/puppet	production	+12 -5
Initial commit of containerd puppet code	operations/puppet	production	+317 -8
kubernetes::worker_containerd: Fix registry_auth hiera key	labs/private	master	+2 -2
kubernetes::worker_containerd: Add registry 'secrets'	labs/private	master	+2 -0
Rename kubernets2009,2010,2035,2054, reimage to bookworm	operations/puppet	production	+10 -10

Related Objects
Search...

Status	Assigned	Task
Open	None	T341984 Update Kubernetes clusters to >1.25
Open	JMeybohm	T269684 [EPIC] Docker deprecation as a container runtime enginer for kubernetes.
Open	JMeybohm	T362408 Migration to containerd and away from docker
Resolved	JMeybohm	T375488 prometheus node exporter filesystem metrics exclude /var/lib/docker and /var/lib/kubelet
Resolved	JMeybohm	T377132 containerd logs are not properly parsed during ingestion to logstash
In Progress	kamila	T377857 Cookbook to roll-reimage k8s nodes
Open	Stevemunene	T377875 Migrate dse-k8s cluster from docker to containerd
Open	None	T377876 Migrate wikikube-eqiad to containerd
Declined	VRiley-WMF	T379622 wikikube-ctrl1001.eqiad.wmnet: The CMOS battery has reached the end of its usable life or has failed.
Resolved	JMeybohm	T379629 wikikube-ctrl1001.eqiad.wmnet fails PXE boot
Open	None	T379717 wikikube-ctrl1002 and wikikube-ctrl1003: Switch network cable from port 2 to port 1 on the 10G NIC
Resolved	None	T381268 Relabel eqiad kubernetes nodes
Resolved	VRiley-WMF	T381504 Relabel eqiad kubernetes nodes
Open	None	T381676 Comm Error: backplane 0 when reimaging wikikube-worker1057
Open	None	T381770 Comm Error: backplane 0 when reimaging wikikube-worker1069
Open	None	T381789 Comm Error: backplane 0 when reimaging wikikube-worker1073
Open	Jclark-ctr	T381878 Comm Error: backplane 0 when reimaging wikikube-worker1081
Resolved	akosiaris	T379790 Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001
Open	None	T377877 Migrate wikikube-codfw to containerd
Resolved	JMeybohm	T379719 wikikube-ctrl2002: Switch network cable from port 2 to port 1 on the 10G NIC
Resolved	Jhancock.wm	T381967 Relabel codfw kubernetes nodes
Resolved	Jhancock.wm	T382420 Comm Error: backplane 0 when reimaging wikikube-worker2190
Resolved	Jhancock.wm	T382422 Relabel codfw kubernetes nodes
Resolved	elukey	T378345 Migrate the AUX K8s cluster to containerd

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Maintenance_bot removed a project: Patch-For-Review.Oct 8 2024, 11:30 AM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm completed:

kubestage2001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410081201_jayme_2952057_kubestage2001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1078677 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate kubestage1003 to containerd

https://gerrit.wikimedia.org/r/1078677

Change #1078678 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate kubestage1004 to containerd

https://gerrit.wikimedia.org/r/1078678

There are some hardware refreshes planned which should go Bookworm + containerd right away:

{T376171}
{T376185}
{T376170}

Change #1078677 merged by JMeybohm:

[operations/puppet@production] Migrate kubestage1003 to containerd

https://gerrit.wikimedia.org/r/1078677

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed:

kubestage1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410100933_jayme_3362138_kubestage1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1078678 merged by JMeybohm:

[operations/puppet@production] Migrate kubestage1004 to containerd

https://gerrit.wikimedia.org/r/1078678

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm

In T362408#10216554, @JMeybohm wrote:

There are some hardware refreshes planned which should go Bookworm + containerd right away:

{T376171}

{T376185}

{T376170}

As well as expansions:

{T376307}
{T376665}

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2024, 12:31 PM

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm completed:

kubestage1004 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410101219_jayme_3392864_kubestage1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1079276 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Remove P{O:kubernetes::staging::worker}

https://gerrit.wikimedia.org/r/1079276

gerritbot added a project: Patch-For-Review.Oct 10 2024, 1:01 PM

Change #1079276 merged by JMeybohm:

[operations/puppet@production] cumin/aliases: Remove P{O:kubernetes::staging::worker}

https://gerrit.wikimedia.org/r/1079276

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2024, 1:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed:

kubestage1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410101623_jayme_3430289_kubestage1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1079935 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes: Create profile::kubernetes::container_runtime

https://gerrit.wikimedia.org/r/1079935

gerritbot added a project: Patch-For-Review.Oct 14 2024, 8:47 AM

Change #1079955 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Merge worker_containerd values back into worker

https://gerrit.wikimedia.org/r/1079955

Change #1079955 merged by JMeybohm:

[labs/private@master] Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079955

JMeybohm mentioned this in rLPRI68b0f377be2e: Merge worker_containerd back to worker role.Oct 14 2024, 10:03 AM

Change #1079935 merged by JMeybohm:

[operations/puppet@production] kubernetes: Create profile::kubernetes::container_runtime

https://gerrit.wikimedia.org/r/1079935

Change #1079960 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079960

Change #1079960 merged by JMeybohm:

[operations/puppet@production] cumin/aliases: Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079960

Change #1079961 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Remove role kubernetes::staging::worker_containerd

https://gerrit.wikimedia.org/r/1079961

Change #1079961 merged by JMeybohm:

[labs/private@master] Remove role kubernetes::staging::worker_containerd

https://gerrit.wikimedia.org/r/1079961

JMeybohm mentioned this in rLPRI45c64ec1774d: Remove role kubernetes::staging::worker_containerd.Oct 14 2024, 10:17 AM

Change #1079970 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube: Prepare clusters for containerd workers

https://gerrit.wikimedia.org/r/1079970

JMeybohm updated the task description. (Show Details)Oct 14 2024, 11:25 AM

Change #1080038 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Enable by default when profile is included

https://gerrit.wikimedia.org/r/1080038

Change #1080042 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Refactor docker integration

https://gerrit.wikimedia.org/r/1080042

Change #1080071 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] containerd: Remove container log line length limit

https://gerrit.wikimedia.org/r/1080071

Change #1080038 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Enable by default when profile is included

https://gerrit.wikimedia.org/r/1080038

Change #1080042 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Refactor docker integration

https://gerrit.wikimedia.org/r/1080042

Change #1080071 merged by JMeybohm:

[operations/puppet@production] containerd: Remove container log line length limit

https://gerrit.wikimedia.org/r/1080071

Change #1079970 merged by JMeybohm:

[operations/puppet@production] wikikube: Prepare clusters for containerd workers

https://gerrit.wikimedia.org/r/1079970

Change #1080554 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube-staging: Migrate control planes to containerd

https://gerrit.wikimedia.org/r/1080554

Change #1080554 merged by JMeybohm:

[operations/puppet@production] wikikube: Remove explicit container_runtime config

https://gerrit.wikimedia.org/r/1080554

Change #992629 had a related patch set uploaded (by JMeybohm; author: Mxmxchere):

[operations/puppet@production] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false

https://gerrit.wikimedia.org/r/992629

Change #992629 merged by JMeybohm:

[operations/puppet@production] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false

https://gerrit.wikimedia.org/r/992629

Change #1081224 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4

https://gerrit.wikimedia.org/r/1081224

Change #1081377 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] Add a cookbook to roll-reimage stacked k8s control planes

https://gerrit.wikimedia.org/r/1081377

Change #1081224 merged by JMeybohm:

[operations/puppet@production] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4

https://gerrit.wikimedia.org/r/1081224

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm completed:

kubestagemaster2005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181121_jayme_1191485_kubestagemaster2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm updated the task description. (Show Details)Oct 18 2024, 2:37 PM

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bookworm completed:

kubestagemaster2003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bookworm completed:

kubestagemaster2003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster2004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181546_jayme_1220197_kubestagemaster2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm completed:

kubestagemaster2003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster2004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181546_jayme_1220197_kubestagemaster2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster2005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181632_jayme_1220197_kubestagemaster2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm closed subtask T377132: containerd logs are not properly parsed during ingestion to logstash as Resolved.Oct 18 2024, 7:47 PM

Change #1081910 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate wikikube-worker208[5689] to containerd

https://gerrit.wikimedia.org/r/1081910

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

kubestagemaster1003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

kubestagemaster1003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster1004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210906_jayme_1708770_kubestagemaster1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

kubestagemaster1003 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster1004 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210906_jayme_1708770_kubestagemaster1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

kubestagemaster1005 (PASS)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210946_jayme_1708770_kubestagemaster1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1082191 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.pool-depool-node: Add support for multiple nodes

https://gerrit.wikimedia.org/r/1082191

Change #1081910 merged by JMeybohm:

[operations/puppet@production] Migrate wikikube-worker208[5689] to containerd

https://gerrit.wikimedia.org/r/1081910

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2085.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2086.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2088.codfw.wmnet with OS bookworm completed:

wikikube-worker2088 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221227_jayme_1935630_wikikube-worker2088.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2085.codfw.wmnet with OS bookworm completed:

wikikube-worker2085 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221231_jayme_1935361_wikikube-worker2085.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2086.codfw.wmnet with OS bookworm completed:

wikikube-worker2086 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221234_jayme_1935371_wikikube-worker2086.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmnet with OS bookworm completed:

wikikube-worker2089 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221237_jayme_1935895_wikikube-worker2089.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

JMeybohm updated the task description. (Show Details)Oct 22 2024, 3:59 PM

BTullis subscribed.Oct 22 2024, 4:15 PM

Change #1082191 merged by jenkins-bot:

[operations/cookbooks@master] k8s.pool-depool-node: Add support for multiple nodes

https://gerrit.wikimedia.org/r/1082191

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Oct 23 2024, 12:06 PM

Change #1081377 merged by jenkins-bot:

[operations/cookbooks@master] Add a cookbook to roll-reimage stacked k8s control planes

https://gerrit.wikimedia.org/r/1081377

Maintenance_bot removed a project: Patch-For-Review.Oct 23 2024, 1:30 PM

elukey closed subtask T378345: Migrate the AUX K8s cluster to containerd as Resolved.Nov 6 2024, 8:18 AM

Change #1090433 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube-staging: Remove obsolete docker hiera config

https://gerrit.wikimedia.org/r/1090433

gerritbot added a project: Patch-For-Review.Nov 12 2024, 9:38 AM

Change #1090433 merged by JMeybohm:

[operations/puppet@production] wikikube-staging: Remove obsolete docker hiera config

https://gerrit.wikimedia.org/r/1090433

Maintenance_bot removed a project: Patch-For-Review.Nov 12 2024, 1:31 PM

Change #1090806 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp

https://gerrit.wikimedia.org/r/1090806

gerritbot added a project: Patch-For-Review.Nov 13 2024, 9:01 AM

Change #1090806 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp

https://gerrit.wikimedia.org/r/1090806

kamila changed the status of subtask T377857: Cookbook to roll-reimage k8s nodes from Open to In Progress.Nov 13 2024, 5:22 PM

Change #1091185 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Wait for 3m after depooling

https://gerrit.wikimedia.org/r/1091185

Change #1091185 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Wait for 3m after depooling

https://gerrit.wikimedia.org/r/1091185

Change #1091202 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Ask for the management password early

https://gerrit.wikimedia.org/r/1091202

Change #1091202 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Ask for the management password early

https://gerrit.wikimedia.org/r/1091202

Change #1094383 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] wikikube: Default to containerd partition layout

https://gerrit.wikimedia.org/r/1094383

Change #1094383 merged by Clément Goubert:

[operations/puppet@production] wikikube: Default to containerd partition layout

https://gerrit.wikimedia.org/r/1094383

• dcausse subscribed.Tue, Nov 26, 5:16 PM

bking subscribed.Tue, Nov 26, 5:28 PM

Maintenance_bot removed a project: Patch-For-Review.Tue, Nov 26, 5:31 PM

Migration to containerd and away from dockerOpen, HighPublicActions

Description

Plan

containerd upgrade

for the actual upgrade

nerdctl

crictl

Kubelet (the above are a prereq)

Metrics

Log processing

Things to do after all k8s nodes have been migrated off of docker

How to migrate to containerd

Details

Related ObjectsSearch...

Event Timeline

Migration to containerd and away from docker
Open, HighPublic
Actions

Related Objects
Search...