Page MenuHomePhabricator

setup/install kubernetes10[18-22]
Closed, ResolvedPublic

Description

New nodes kubernetes10[18-22] have been handed over by DC-Ops and need to be setup/added do the cluster.

These are replacements for kubernetes100[1-4], so those need to be decommissioned afterwards.

https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes

Event Timeline

@akosiaris we could postpone this a bit and image the nodes with bullseye + overlayfs directly (T300744) to not loose capacity when something goes sideways. AIUI we will be having replacements racked in codfw soon as well. We could do the same with them.

akosiaris changed the task status from Open to Stalled.Feb 16 2022, 1:26 PM

@akosiaris we could postpone this a bit and image the nodes with bullseye + overlayfs directly (T300744) to not loose capacity when something goes sideways. AIUI we will be having replacements racked in codfw soon as well. We could do the same with them.

Yes, I don't see why not. Ping me when I can proceed with this.

@akosiaris in theory we can have bullseye+overlay nodes simply adding this per-host hiera config:

# See https://phabricator.wikimedia.org/T300744
profile::base::overlayfs: true
profile::docker::engine::force_default_docker_storage: true
profile::docker::storage::physical_volumes: ~
profile::docker::storage::vg_to_remove: ~
profile::docker::engine::packagename: "docker.io"

We are currently checking that everything looks good on the ml-serve bullseye nodes (ml-serve200[5-8], once done you are free to go :)

@akosiaris in theory we can have bullseye+overlay nodes simply adding this per-host hiera config:

# See https://phabricator.wikimedia.org/T300744
profile::base::overlayfs: true
profile::docker::engine::force_default_docker_storage: true
profile::docker::storage::physical_volumes: ~
profile::docker::storage::vg_to_remove: ~

We are currently checking that everything looks good on the ml-serve bullseye nodes (ml-serve200[5-8], once done you are free to go :)

Super! Can't wait for it :-)

@akosiaris both staging clusters are on bullseye with overlay, I have updated the hiera settings after some rounds of reimage. I am currently reimaging all ml-serve nodes with the following procedure:

  1. disable puppet on the target node
  2. merge a puppet change with the above per-host hiera config
  3. drain the node + depool from kubesvc
  4. kick off the reimage with bullseye

Then once the host is up and running, uncordon/pool/etc.. For new nodes it is easier, maybe we could try to add one with bullseye + overlay and check how it goes. @JMeybohm thoughts?

Then once the host is up and running, uncordon/pool/etc.. For new nodes it is easier, maybe we could try to add one with bullseye + overlay and check how it goes. @JMeybohm thoughts?

Yes. I would start with codfw (T302208) but that is probably more like personal preference.

akosiaris changed the task status from Stalled to Open.Feb 24 2022, 8:13 AM

Cool, moving back to open, I 'll tend to this early next week.

Important note - we had to modify grub's config to add systemd.unified_cgroup_hierarchy=0 (the kubelets do not support the new cgroup hierarchy in 1.16) so a reboot is needed (the one done by the reimage cookbook is enough).

elukey renamed this task from setup/install kubernetes10[18-21] to setup/install kubernetes10[18-22].Feb 28 2022, 10:34 AM
elukey updated the task description. (Show Details)

Change 766588 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: set new partman recipe for new k8s nodes

https://gerrit.wikimedia.org/r/766588

Change 766588 merged by Elukey:

[operations/puppet@production] install_server: set new partman recipe for new k8s nodes

https://gerrit.wikimedia.org/r/766588

Change 771564 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/homer/public@master] Add kubernetes1018-1022 as BGP neighbors

https://gerrit.wikimedia.org/r/771564

Change 771598 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Add kubernetes1018-1022

https://gerrit.wikimedia.org/r/771598

Change 771564 merged by jenkins-bot:

[operations/homer/public@master] Add kubernetes1018-1022 as BGP neighbors

https://gerrit.wikimedia.org/r/771564

Important note - we had to modify grub's config to add systemd.unified_cgroup_hierarchy=0 (the kubelets do not support the new cgroup hierarchy in 1.16) so a reboot is needed (the one done by the reimage cookbook is enough).

4/5 nodes are stretch, I 'll reimage all of them anyway as part of the process. I guess that should do it.

Change 771598 merged by Alexandros Kosiaris:

[operations/puppet@production] Add kubernetes1018-1022

https://gerrit.wikimedia.org/r/771598

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1018.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1018 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1019.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1019 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1021.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1022 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1020.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1020 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1001 for host kubernetes1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1018.eqiad.wmnet with OS bullseye completed:

  • kubernetes1018 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203171705_akosiaris_1754508_kubernetes1018.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1020.eqiad.wmnet with OS bullseye completed:

  • kubernetes1020 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203171709_akosiaris_1754946_kubernetes1020.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1019.eqiad.wmnet with OS bullseye completed:

  • kubernetes1019 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203171707_akosiaris_1754672_kubernetes1019.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1021.eqiad.wmnet with OS bullseye completed:

  • kubernetes1021 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203171708_akosiaris_1754826_kubernetes1021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1001 for host kubernetes1022.eqiad.wmnet with OS bullseye completed:

  • kubernetes1022 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203171708_akosiaris_1754867_kubernetes1022.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-03-17T18:18:25Z] <akosiaris> cordon kubernetes10{18..22} T293728

Mentioned in SAL (#wikimedia-operations) [2022-03-18T09:37:44Z] <akosiaris> pool kubernetes1018-1022 in pybal. T293728

Mentioned in SAL (#wikimedia-operations) [2022-03-18T09:42:00Z] <akosiaris> uncordon kubernetes1018-1022. T293728. Nodes are live, ready to receive workloads and traffic.

akosiaris changed the status of subtask T303044: decommission kubernetes100[1-4] from Stalled to Open.