Page MenuHomePhabricator

Move codfw thumbor hosts to kubernetes cluster
Closed, ResolvedPublic

Description

Thumbor's bare metal days are over, and it is time to have any (young) thumbor servers join our kubernetes cluster.

DC-Ops ops-codfw

  • Physically re-label thumbor2005 to kubernetes2055
  • Physically re-label thumbor2006 to kubernetes2056

serviceops

  • Remove from puppet
  • update netbox
  • Follow the usual new server process

Event Timeline

Jhancock.wm subscribed.

servers have been physically relabeled.

Mentioned in SAL (#wikimedia-operations) [2023-08-21T14:14:21Z] <hnowlan@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hnowlan: updating records for reuse of thumbor servers for k8s nodes T343996 T343993 - hnowlan@cumin1001"

Mentioned in SAL (#wikimedia-operations) [2023-08-21T14:15:10Z] <hnowlan@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hnowlan: updating records for reuse of thumbor servers for k8s nodes T343996 T343993 - hnowlan@cumin1001"

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2055 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye executed with errors:

  • kubernetes2056 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:14:43Z] <hnowlan@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "refreshing kubernetes205[56] kubernetes105[78] status T343996 T343993 - hnowlan@cumin1001"

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:15:48Z] <hnowlan@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "refreshing kubernetes205[56] kubernetes105[78] status T343996 T343993 - hnowlan@cumin1001"

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye

Change 951522 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: add new kubernetes nodes to calico

https://gerrit.wikimedia.org/r/951522

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2055.codfw.wmnet with OS bullseye completed:

  • kubernetes2055 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221446_hnowlan_394988_kubernetes2055.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host kubernetes2056.codfw.wmnet with OS bullseye completed:

  • kubernetes2056 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221449_hnowlan_398030_kubernetes2056.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 951524 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: add new nodes

https://gerrit.wikimedia.org/r/951524

Change 951522 merged by Hnowlan:

[operations/puppet@production] kubernetes: add new kubernetes nodes to calico

https://gerrit.wikimedia.org/r/951522

Change 951524 merged by Hnowlan:

[operations/puppet@production] kubernetes: add new nodes

https://gerrit.wikimedia.org/r/951524

hnowlan claimed this task.
hnowlan updated the task description. (Show Details)