Page MenuHomePhabricator

[ceph] Upgrade hosts to bullseye
Closed, ResolvedPublic

Description

This was blocked by not having the packages, but with T309786 we unblock this one.

codfw

  • cloudcephmon2004-dev
  • cloudcephmon2005-dev
  • cloudcephmon2006-dev
  • cloudcephosd2001-dev (replaced with new hardware)
  • cloudcephosd2002-dev (replaced with new hardware)
  • cloudcephosd2003-dev (replaced with new hardware)

eqiad

  • cloudcephosd1006
  • cloudcephosd1007
  • cloudcephosd1008
  • cloudcephosd1009
  • cloudcephosd1011
  • cloudcephosd1012
  • cloudcephosd1013
  • cloudcephosd1014
  • cloudcephosd1015
  • cloudcephosd1016
  • cloudcephosd1017
  • cloudcephosd1018
  • cloudcephosd1019
  • cloudcephosd1020
  • cloudcephosd1022
  • cloudcephosd1023
  • cloudcephosd1024
  • cloudcephmon1001 -- replaced with cloudcephmon1004
  • cloudcephmon1002 -- replaced with cloudcephmon1005
  • cloudcephmon1003 -- replaced with cloudcephmon1006

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1013 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1013.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1013.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501271413_andrew_2581557_cloudcephosd1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Aklapper changed the task status from In Progress to Open.Mar 22 2025, 7:23 AM
Aklapper subscribed.

Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than two years.

This is still ongoing work :/

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-12T02:19:02Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-12T05:23:14Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

Change #1156302 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1015 -> puppet 7

https://gerrit.wikimedia.org/r/1156302

Change #1156303 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1016 -> puppet 7

https://gerrit.wikimedia.org/r/1156303

Change #1156304 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1017 -> puppet 7

https://gerrit.wikimedia.org/r/1156304

Change #1156305 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1018 -> puppet 7

https://gerrit.wikimedia.org/r/1156305

Change #1156306 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1019 -> puppet 7

https://gerrit.wikimedia.org/r/1156306

Change #1156307 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1020 -> puppet 7

https://gerrit.wikimedia.org/r/1156307

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-12T11:41:35Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Change #1156375 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1014: update nic names

https://gerrit.wikimedia.org/r/1156375

Change #1156375 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1014: update nic names

https://gerrit.wikimedia.org/r/1156375

Change #1156302 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1015 -> puppet 7

https://gerrit.wikimedia.org/r/1156302

cloudcephosd1015 reimage troubleshooting things I checked:

  • The PXE boot setting is indeed set to the 10G NIC's port 0 as expected.
  • Boot system with: sudo cookbook sre.hosts.reimage --os bookworm cloudcephosd1015
  • System fires into reimage script, but when sending for DHCP the following shows on the install1004 syslog, but no DHCP accept, it then fails into the normal OS load:
Jun 12 16:51:29 install1004 dhcpd[60964]: DHCPDISCOVER from bc:97:e1:28:3a:10 via 10.64.20.252
Jun 12 16:51:29 install1004 dhcpd[60964]: DHCPOFFER on 10.64.20.66 to bc:97:e1:28:3a:10 via 10.64.20.252

So it appears its sending in the request, and our install host is sending back the offer, but the offer is not being received by cloudcephosd1015 and it fails to booting off disk instead.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-12T17:51:05Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

So it appears its sending in the request, and our install host is sending back the offer, but the offer is not being received by cloudcephosd1015 and it fails to booting off disk instead.

Thanks Rob. Yeah to check I connected to the host over mgmt console, and then deleted its static IP and manually ran dhclient. Which worked fine, i.e. the DHCP OFFER made it back to the host, the host processed it and replied with REQ etc. You can see the flow in P77858

So the issue is not network comms from the switch/port to the install server and back, or due to any misconfig with vlan etc. Not sure why the OFFER would be ignored at PXEboot stage, not sure I've seen that one before.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-12T19:18:00Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-12T19:49:02Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

Change #1156303 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1016 -> puppet 7

https://gerrit.wikimedia.org/r/1156303

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T00:36:46Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T00:57:38Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T309789)

Change #1156304 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1017 -> puppet 7

https://gerrit.wikimedia.org/r/1156304

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T04:07:00Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T06:31:21Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T11:46:18Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Change #1156305 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1018 -> puppet 7

https://gerrit.wikimedia.org/r/1156305

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T15:50:14Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

Change #1156306 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1019 -> puppet 7

https://gerrit.wikimedia.org/r/1156306

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-13T20:13:15Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T00:17:58Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T01:38:44Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T03:51:35Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T03:55:25Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T03:56:17Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T03:56:47Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T03:57:47Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T04:23:01Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Change #1156307 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1020 -> puppet 7

https://gerrit.wikimedia.org/r/1156307

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T04:23:51Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T04:25:54Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T07:29:48Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

Change #1157505 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1022: prepare for Bullseye upgrade

https://gerrit.wikimedia.org/r/1157505

Change #1157506 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1023: prepare for Bullseye upgrade

https://gerrit.wikimedia.org/r/1157506

Change #1157507 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudcephosd1024: prepare for Bullseye upgrade

https://gerrit.wikimedia.org/r/1157507

Change #1157505 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1022: prepare for Bullseye upgrade

https://gerrit.wikimedia.org/r/1157505

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T13:30:57Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-14T18:44:34Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T309789)

Change #1157506 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1023: prepare for Bullseye upgrade

https://gerrit.wikimedia.org/r/1157506

Change #1157507 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd1024: prepare for Bullseye upgrade

https://gerrit.wikimedia.org/r/1157507

Change #1159425 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] codfw1dev ceph: cloudcephmons -> puppet 7

https://gerrit.wikimedia.org/r/1159425

Change #1159425 merged by Andrew Bogott:

[operations/puppet@production] codfw1dev ceph: cloudcephmons -> puppet 7

https://gerrit.wikimedia.org/r/1159425