Page MenuHomePhabricator

Move eqiad thumbor hosts to kubernetes cluster
Closed, ResolvedPublic

Description

Thumbor's bare metal days are over, and it is time to have any (young) thumbor servers join our kubernetes cluster.

DC-Ops ops-eqiad

  • Physically re-label thumbor1005 to kubernetes1057
  • Physically re-label thumbor1006 to kubernetes1058

serviceops

  • Remove from puppet
  • update netbox
  • Follow the usual new server process

Event Timeline

jijiki renamed this task from Move eqiad thumbor hosts to kubernetes cluster thumbor1005, thumbor1006 to kubernetes1057 and kubernetes1058 to Move eqiad thumbor hosts to kubernetes cluster.Aug 10 2023, 3:29 PM
jijiki created this task.

Mentioned in SAL (#wikimedia-operations) [2023-08-21T14:14:21Z] <hnowlan@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hnowlan: updating records for reuse of thumbor servers for k8s nodes T343996 T343993 - hnowlan@cumin1001"

Mentioned in SAL (#wikimedia-operations) [2023-08-21T14:15:10Z] <hnowlan@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hnowlan: updating records for reuse of thumbor servers for k8s nodes T343996 T343993 - hnowlan@cumin1001"

Change 951130 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] site: change kubernetes::worker regex to include kubernetes105[78]

https://gerrit.wikimedia.org/r/951130

Change 951130 merged by Hnowlan:

[operations/puppet@production] site: change kubernetes::worker regex to include ex-thumbor hosts

https://gerrit.wikimedia.org/r/951130

Change 951439 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/homer/public@master] sites: add new kubernetes hosts

https://gerrit.wikimedia.org/r/951439

Change 951442 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] install_server: configure disks for new kubernetes hosts

https://gerrit.wikimedia.org/r/951442

Change 951439 merged by jenkins-bot:

[operations/homer/public@master] sites: add new kubernetes hosts

https://gerrit.wikimedia.org/r/951439

Change 951442 merged by Hnowlan:

[operations/puppet@production] install_server: configure disks for new kubernetes hosts

https://gerrit.wikimedia.org/r/951442

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1058 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1057 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:14:43Z] <hnowlan@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "refreshing kubernetes205[56] kubernetes105[78] status T343996 T343993 - hnowlan@cumin1001"

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:15:48Z] <hnowlan@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "refreshing kubernetes205[56] kubernetes105[78] status T343996 T343993 - hnowlan@cumin1001"

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1057.eqiad.wmnet with OS bullseye completed:

  • kubernetes1057 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221433_hnowlan_858379_kubernetes1057.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1001 for host kubernetes1058.eqiad.wmnet with OS bullseye completed:

  • kubernetes1058 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308221436_hnowlan_858601_kubernetes1058.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 951522 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: add new kubernetes nodes to calico

https://gerrit.wikimedia.org/r/951522

Change 951524 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: add new nodes

https://gerrit.wikimedia.org/r/951524

Change 951522 merged by Hnowlan:

[operations/puppet@production] kubernetes: add new kubernetes nodes to calico

https://gerrit.wikimedia.org/r/951522

Change 951524 merged by Hnowlan:

[operations/puppet@production] kubernetes: add new nodes

https://gerrit.wikimedia.org/r/951524

Updated physical labeling as requested.

Jclark-ctr updated the task description. (Show Details)