Page MenuHomePhabricator

setup/install kubernetes20[1(89)|2(012)]
Closed, ResolvedPublic

Description

New nodes kubernetes20[1(89)|2(012)] have been handed over by DC-Ops and need to be setup/added do the cluster.

We should set them up with bullseye + overlayfs (T300744) as extra capacity first and decommission kubernetes200[1-4] (which these are replacements for) after the cluster has been completely migrated.

https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes

Event Timeline

JMeybohm added a parent task: Unknown Object (Task).Feb 21 2022, 11:07 AM

Change 766588 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: set new partman recipe for new k8s nodes

https://gerrit.wikimedia.org/r/766588

Change 766588 merged by Elukey:

[operations/puppet@production] install_server: set new partman recipe for new k8s nodes

https://gerrit.wikimedia.org/r/766588

elukey renamed this task from setup/install kubernetes20[19|2(012)] to setup/install kubernetes20[1(89)|2(012)].Feb 28 2022, 1:40 PM
elukey updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2018.codfw.wmnet with OS bullseye completed:

  • kubernetes2018 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202281350_elukey_16234_kubernetes2018.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2019.codfw.wmnet with OS bullseye completed:

  • kubernetes2019 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202281420_elukey_12515_kubernetes2019.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye completed:

  • kubernetes2020 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202281450_elukey_7945_kubernetes2020.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2585730/

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2020.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2020 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202281450_elukey_7945_kubernetes2020.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2585730/
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye

Change 766808 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Enable overlayfs for kubernetes20[18-22]

https://gerrit.wikimedia.org/r/766808

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2021.codfw.wmnet with OS bullseye completed:

  • kubernetes2021 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202281550_elukey_30787_kubernetes2021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kubernetes2022.codfw.wmnet with OS bullseye completed:

  • kubernetes2022 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202281654_elukey_25173_kubernetes2022.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 766808 merged by Elukey:

[operations/puppet@production] Enable overlayfs for kubernetes20[18-22]

https://gerrit.wikimedia.org/r/766808

All nodes have Bullseye and the new partition layout for overlay. I have also enabled overlay via puppet, and manually added the systemd.unified_cgroup_hierarchy=0 to grub's settings and rebooted (it is applied by profile::kubernetes::node, so convenient for a pre-existing worker but less for a new one).

The hosts seem ready to be added to the k8s codfw cluster, @JMeybohm lemme know how you want to proceed :)

The hosts seem ready to be added to the k8s codfw cluster, @JMeybohm lemme know how you want to proceed :)

❤️
From my POV I say you can continue with adding the new nodes to the cluster.

Change 767465 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add kubernetes2018 to wikikube codfw

https://gerrit.wikimedia.org/r/767465

Change 767468 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add BGP config for kubernetes2018

https://gerrit.wikimedia.org/r/767468

Change 767465 merged by Elukey:

[operations/puppet@production] Add kubernetes2018 to wikikube codfw

https://gerrit.wikimedia.org/r/767465

Change 767468 merged by Elukey:

[operations/homer/public@master] Add BGP config for kubernetes2018

https://gerrit.wikimedia.org/r/767468

Change 767482 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add kubernetes20[19-22] to wikikube codfw

https://gerrit.wikimedia.org/r/767482

Change 767485 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/homer/public@master] Add BGP config for kubernetes20[19-22] in wikikube codfw

https://gerrit.wikimedia.org/r/767485

Status:

  • We have added kubernetes2018 to the wikikube codfw cluster, it is currently serving traffic.
  • The remaining four nodes will be added after testing a bit 2018. Code reviews ready.

Change 767482 merged by Elukey:

[operations/puppet@production] Add kubernetes20[19-22] to wikikube codfw

https://gerrit.wikimedia.org/r/767482

The first puppet run seems to always end up in:

Notice: /Stage[main]/Profile::Kubernetes::Node/K8s::Kubeconfig[/etc/kubernetes/kubelet_config]/File[/etc/kubernetes/kubelet_config]/ensure: defined content as '{md5}a17b9fafdfa37dd7305196acdef3119d'
Info: /Stage[main]/Profile::Kubernetes::Node/K8s::Kubeconfig[/etc/kubernetes/kubelet_config]/File[/etc/kubernetes/kubelet_config]: Scheduling refresh of Service[kubelet]
Error: Systemd start for kubelet failed!
journalctl log for kubelet:
-- Journal begins at Mon 2022-02-28 15:03:26 UTC, ends at Thu 2022-03-03 09:27:40 UTC. --
-- No entries --

Error: /Stage[main]/K8s::Kubelet/Service[kubelet]/ensure: change from 'stopped' to 'running' failed: Systemd start for kubelet failed!
journalctl log for kubelet:
-- Journal begins at Mon 2022-02-28 15:03:26 UTC, ends at Thu 2022-03-03 09:27:40 UTC. --
-- No entries --

Notice: /Stage[main]/K8s::Kubelet/Service[kubelet]: Triggered 'refresh' from 3 events
Notice: /Stage[main]/Profile::Kubernetes::Node/Base::Expose_puppet_certs[/etc/kubernetes]/File[/etc/kubernetes/ssl]/ensure: created
Notice: /Stage[main]/Profile::Kubernetes::Node/Base::Expose_puppet_certs[/etc/kubernetes]/File[/etc/kubernetes/ssl/cert.pem]/ensure: defined content as '{md5}d344c0d386beca69a0cc66131557b40c'
Notice: /Stage[main]/Profile::Kubernetes::Node/Base::Expose_puppet_certs[/etc/kubernetes]/File[/etc/kubernetes/ssl/server.key]/ensure: defined content as '{md5}2122f5e303b21582ae3fb8bb7ea0b25b'
Notice: /Stage[main]/Profile::Kubernetes::Node/K8s::Kubeconfig[/etc/kubernetes/kubeproxy_config]/File[/etc/kubernetes/kubeproxy_config]/ensure: defined content as '{md5}bc04bfb1a07d81795aa6169157d33dc7'
Info: /Stage[main]/Profile::Kubernetes::Node/K8s::Kubeconfig[/etc/kubernetes/kubeproxy_config]/File[/etc/kubernetes/kubeproxy_config]: Scheduling refresh of Service[kube-proxy]
Error: Systemd start for kube-proxy failed!
journalctl log for kube-proxy:
-- Journal begins at Mon 2022-02-28 15:03:26 UTC, ends at Thu 2022-03-03 09:27:40 UTC. --
-- No entries --

Error: /Stage[main]/K8s::Proxy/Service[kube-proxy]/ensure: change from 'stopped' to 'running' failed: Systemd start for kube-proxy failed!
journalctl log for kube-proxy:
-- Journal begins at Mon 2022-02-28 15:03:26 UTC, ends at Thu 2022-03-03 09:27:40 UTC. --
-- No entries --

Notice: /Stage[main]/K8s::Proxy/Service[kube-proxy]: Triggered 'refresh' from 3 events

I can't find logs indicating what's wrong, but a subsequent puppet run fixes it.

Change 767485 merged by Elukey:

[operations/homer/public@master] Add BGP config for kubernetes20[19-22] in wikikube codfw

https://gerrit.wikimedia.org/r/767485

All new nodes up and running!

I didn't spot anything weird, all bgp sessions seem to be up, pods scheduled on the new nodes (for the moment only calico/istio).

Mentioned in SAL (#wikimedia-operations) [2022-03-03T10:18:56Z] <elukey> kubectl cordon kubernetes200[1-4] to avoid scheduling pods on nodes that will be decommed during the next weeks - T302208

JMeybohm claimed this task.

parent and decom tasks created/updated, closing this

Change 773467 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set bullseye + overlayfs for kubernetes1017

https://gerrit.wikimedia.org/r/773467

Change 773467 merged by Elukey:

[operations/puppet@production] Set bullseye + overlayfs for kubernetes1017

https://gerrit.wikimedia.org/r/773467