Page MenuHomePhabricator

Migration to containerd and away from docker
Open, HighPublic

Description

Per T269684 we need to move away from docker. In February 2024, the serviceops team announced the results of the evaluation of the candidate replacement engines. Results and criteria have been documented in Kubernetes/CRE. The chosen container runtime engine was containerd. This task describes the plan for the migration and tracks the migration process itself

Plan

containerd upgrade

  1. We 'll probably need a new profile profile::containerd or similar.
  2. Create proper cgroups config for containerd (https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd)
  3. Handle pulling of restricted images with containerd (provide authentication credentials etc)
  4. Test integration with dragonfly/dfget

for the actual upgrade

  1. Run some workers (4 in codfw as a start) with bookworm, to surface potential OS related issues
    1. wikikube-worker2085 (R440)
    2. wikikube-worker2086 (R440)
    3. wikikube-worker2088 (Supermicro)
    4. wikikube-worker2089 (R450)
  2. Create puppetization for the configuration required by kubernetes
  3. Reimage some nodes with bookworm + containerd (>=1.6)
  4. Upgrade all clusters to the newer containerd, rolling-reimage of nodes

nerdctl

Docker has a relatively user friendly CLI. containerd doesn't. the ctr tool it ships with is a lower level, albeit useful tool. nerdctl, is a CLI released by the containerd project that is CLI compatible with docker CLI

  1. Package nerdctl. Probably utilizing our Upstream binaries policy to avoid the onus of having to build every since dependency
  2. Use puppet to install the package and populate a nerdctl configuration file /etc/nerdctl/nerdctl.toml to default to namespace k8s.io
  3. Test and approve.

crictl

Kubernetes build crictl/cri-tools https://github.com/kubernetes-sigs/cri-tools/tree/master to interact with a CRI the way kubelet would. In my initial tests with nerctl it did not completely honor all containerd configuration (like registry mirrors and authentication we require for dragonfly). So I decided to also package cricrl and have it installed on all nodes.

Kubelet (the above are a prereq)

  1. Amend puppet to have behind a feature flag the following 2 parameters
--container-runtime-endpoint=unix:///run/containerd/containerd.sock 
--container-runtime=remote

Metrics

  • Replace kubelet_docker_operations_* with kubelet_runtime_operations_*

Log processing

Parsing of logs does not work properly with containerd nodes. Logs that usually have the k8s_docker_log_field_parsed tag don't have it anymore:

T377132: containerd logs are not properly parsed during ingestion to logstash

Things to do after all k8s nodes have been migrated off of docker

  1. Remove puppet classes no longer in use (if there are any)
  2. Ensure all profile::docker::engine related hiera keys are gone (as well as profile::kubernetes::node::docker_kubernetes_user_password)

How to migrate to containerd

https://wikitech.wikimedia.org/wiki/Kubernetes/Administration/containerd_migration

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+30 -42
operations/cookbooksmaster+4 -0
operations/cookbooksmaster+6 -0
operations/cookbooksmaster+25 -8
operations/puppetproduction+0 -9
operations/cookbooksmaster+307 -0
operations/cookbooksmaster+161 -106
operations/puppetproduction+5 -16
operations/puppetproduction+8 -0
operations/puppetproduction+2 -2
operations/puppetproduction+0 -2
operations/puppetproduction+24 -9
operations/puppetproduction+4 -0
operations/puppetproduction+27 -23
operations/puppetproduction+1 -5
labs/privatemaster+0 -1
operations/puppetproduction+4 -4
operations/puppetproduction+56 -82
labs/privatemaster+2 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -53
operations/puppetproduction+7 -3
operations/puppetproduction+0 -7
operations/puppetproduction+1 -1
operations/puppetproduction+14 -3
operations/debs/kubernetesv1.23+10 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -6
operations/puppetproduction+60 -0
operations/puppetproduction+15 -0
operations/puppetproduction+4 -4
operations/puppetproduction+4 -3
operations/puppetproduction+17 -5
labs/privatemaster+2 -2
operations/puppetproduction+12 -5
operations/puppetproduction+317 -8
labs/privatemaster+2 -2
labs/privatemaster+2 -0
operations/puppetproduction+10 -10
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm

Change #1078451 merged by JMeybohm:

[operations/puppet@production] k8s/kubelet: Remove absent containerd specific systemd override

https://gerrit.wikimedia.org/r/1078451

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2002.codfw.wmnet with OS bookworm completed:

  • kubestage2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410081109_jayme_2939229_kubestage2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage2001.codfw.wmnet with OS bookworm completed:

  • kubestage2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410081201_jayme_2952057_kubestage2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1078677 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate kubestage1003 to containerd

https://gerrit.wikimedia.org/r/1078677

Change #1078678 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate kubestage1004 to containerd

https://gerrit.wikimedia.org/r/1078678

There are some hardware refreshes planned which should go Bookworm + containerd right away:

  • {T376171}
  • {T376185}
  • {T376170}

Change #1078677 merged by JMeybohm:

[operations/puppet@production] Migrate kubestage1003 to containerd

https://gerrit.wikimedia.org/r/1078677

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed:

  • kubestage1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410100933_jayme_3362138_kubestage1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1078678 merged by JMeybohm:

[operations/puppet@production] Migrate kubestage1004 to containerd

https://gerrit.wikimedia.org/r/1078678

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm

There are some hardware refreshes planned which should go Bookworm + containerd right away:

  • {T376171}
  • {T376185}
  • {T376170}

As well as expansions:

  • {T376307}
  • {T376665}

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1004.eqiad.wmnet with OS bookworm completed:

  • kubestage1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410101219_jayme_3392864_kubestage1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1079276 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Remove P{O:kubernetes::staging::worker}

https://gerrit.wikimedia.org/r/1079276

Change #1079276 merged by JMeybohm:

[operations/puppet@production] cumin/aliases: Remove P{O:kubernetes::staging::worker}

https://gerrit.wikimedia.org/r/1079276

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestage1003.eqiad.wmnet with OS bookworm completed:

  • kubestage1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410101623_jayme_3430289_kubestage1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1079935 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes: Create profile::kubernetes::container_runtime

https://gerrit.wikimedia.org/r/1079935

Change #1079955 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Merge worker_containerd values back into worker

https://gerrit.wikimedia.org/r/1079955

Change #1079955 merged by JMeybohm:

[labs/private@master] Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079955

Change #1079935 merged by JMeybohm:

[operations/puppet@production] kubernetes: Create profile::kubernetes::container_runtime

https://gerrit.wikimedia.org/r/1079935

Change #1079960 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] cumin/aliases: Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079960

Change #1079960 merged by JMeybohm:

[operations/puppet@production] cumin/aliases: Merge worker_containerd back to worker role

https://gerrit.wikimedia.org/r/1079960

Change #1079961 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Remove role kubernetes::staging::worker_containerd

https://gerrit.wikimedia.org/r/1079961

Change #1079961 merged by JMeybohm:

[labs/private@master] Remove role kubernetes::staging::worker_containerd

https://gerrit.wikimedia.org/r/1079961

Change #1079970 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube: Prepare clusters for containerd workers

https://gerrit.wikimedia.org/r/1079970

Change #1080038 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Enable by default when profile is included

https://gerrit.wikimedia.org/r/1080038

Change #1080042 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] dragonfly::dfdaemon: Refactor docker integration

https://gerrit.wikimedia.org/r/1080042

Change #1080071 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] containerd: Remove container log line length limit

https://gerrit.wikimedia.org/r/1080071

Change #1080038 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Enable by default when profile is included

https://gerrit.wikimedia.org/r/1080038

Change #1080042 merged by JMeybohm:

[operations/puppet@production] dragonfly::dfdaemon: Refactor docker integration

https://gerrit.wikimedia.org/r/1080042

Change #1080071 merged by JMeybohm:

[operations/puppet@production] containerd: Remove container log line length limit

https://gerrit.wikimedia.org/r/1080071

Change #1079970 merged by JMeybohm:

[operations/puppet@production] wikikube: Prepare clusters for containerd workers

https://gerrit.wikimedia.org/r/1079970

Change #1080554 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube-staging: Migrate control planes to containerd

https://gerrit.wikimedia.org/r/1080554

Change #1080554 merged by JMeybohm:

[operations/puppet@production] wikikube: Remove explicit container_runtime config

https://gerrit.wikimedia.org/r/1080554

Change #992629 had a related patch set uploaded (by JMeybohm; author: Mxmxchere):

[operations/puppet@production] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false

https://gerrit.wikimedia.org/r/992629

Change #992629 merged by JMeybohm:

[operations/puppet@production] etcd::v3: Don't set trusted-ca-file if client-cert-auth is false

https://gerrit.wikimedia.org/r/992629

Change #1081224 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4

https://gerrit.wikimedia.org/r/1081224

Change #1081377 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] Add a cookbook to roll-reimage stacked k8s control planes

https://gerrit.wikimedia.org/r/1081377

Change #1081224 merged by JMeybohm:

[operations/puppet@production] etcd::v3: Ensure trusted-ca-file is not set on first puppet run with 3.4

https://gerrit.wikimedia.org/r/1081224

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm completed:

  • kubestagemaster2005 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181121_jayme_1191485_kubestagemaster2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2003.codfw.wmnet with OS bookworm completed:

  • kubestagemaster2003 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bookworm completed:

  • kubestagemaster2003 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
  • kubestagemaster2004 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181546_jayme_1220197_kubestagemaster2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bookworm completed:

  • kubestagemaster2003 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181502_jayme_1220197_kubestagemaster2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
  • kubestagemaster2004 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181546_jayme_1220197_kubestagemaster2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
  • kubestagemaster2005 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410181632_jayme_1220197_kubestagemaster2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1081910 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Migrate wikikube-worker208[5689] to containerd

https://gerrit.wikimedia.org/r/1081910

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

  • kubestagemaster1003 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

  • kubestagemaster1003 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
  • kubestagemaster1004 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210906_jayme_1708770_kubestagemaster1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.k8s.reimage-stacked-control-plane started by jayme@cumin1002 Reimaging k8s control planes of cluster staging-eqiad: containerd migration completed:

  • kubestagemaster1003 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210826_jayme_1708770_kubestagemaster1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
  • kubestagemaster1004 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210906_jayme_1708770_kubestagemaster1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
  • kubestagemaster1005 (PASS)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410210946_jayme_1708770_kubestagemaster1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1082191 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.pool-depool-node: Add support for multiple nodes

https://gerrit.wikimedia.org/r/1082191

Change #1081910 merged by JMeybohm:

[operations/puppet@production] Migrate wikikube-worker208[5689] to containerd

https://gerrit.wikimedia.org/r/1081910

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2085.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2086.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2088.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2088.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2088 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221227_jayme_1935630_wikikube-worker2088.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2085.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2085 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221231_jayme_1935361_wikikube-worker2085.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2086.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2086 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221234_jayme_1935371_wikikube-worker2086.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube-worker2089.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2089 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410221237_jayme_1935895_wikikube-worker2089.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1082191 merged by jenkins-bot:

[operations/cookbooks@master] k8s.pool-depool-node: Add support for multiple nodes

https://gerrit.wikimedia.org/r/1082191

Change #1081377 merged by jenkins-bot:

[operations/cookbooks@master] Add a cookbook to roll-reimage stacked k8s control planes

https://gerrit.wikimedia.org/r/1081377

Change #1090433 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] wikikube-staging: Remove obsolete docker hiera config

https://gerrit.wikimedia.org/r/1090433

Change #1090433 merged by JMeybohm:

[operations/puppet@production] wikikube-staging: Remove obsolete docker hiera config

https://gerrit.wikimedia.org/r/1090433

Change #1090806 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp

https://gerrit.wikimedia.org/r/1090806

Change #1090806 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Add --force-dhcp-tftp

https://gerrit.wikimedia.org/r/1090806

Change #1091185 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Wait for 3m after depooling

https://gerrit.wikimedia.org/r/1091185

Change #1091185 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Wait for 3m after depooling

https://gerrit.wikimedia.org/r/1091185

Change #1091202 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Ask for the management password early

https://gerrit.wikimedia.org/r/1091202

Change #1091202 merged by jenkins-bot:

[operations/cookbooks@master] k8s.reimage-stacked-control-plane: Ask for the management password early

https://gerrit.wikimedia.org/r/1091202

Change #1094383 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] wikikube: Default to containerd partition layout

https://gerrit.wikimedia.org/r/1094383