Page MenuHomePhabricator

[ceph] Upgrade hosts to bullseye
Open, In Progress, HighPublic

Description

This was blocked by not having the packages, but with T309786 we unblock this one.

codfw

  • cloudcephmon2004-dev
  • cloudcephmon2005-dev
  • cloudcephmon2006-dev
  • cloudcephosd2001-dev
  • cloudcephosd2002-dev
  • cloudcephosd2003-dev

eqiad

  • cloudcephosd1006
  • cloudcephosd1007
  • cloudcephosd1008
  • cloudcephosd1009
  • cloudcephosd1011
  • cloudcephosd1012
  • cloudcephosd1013
  • cloudcephosd1014
  • cloudcephosd1015
  • cloudcephosd1016
  • cloudcephosd1017
  • cloudcephosd1018
  • cloudcephosd1019
  • cloudcephosd1020
  • cloudcephosd1022
  • cloudcephosd1023
  • cloudcephosd1024
  • cloudcephmon1001
  • cloudcephmon1002
  • cloudcephmon1003

Event Timeline

dcaro triaged this task as High priority.Jun 2 2022, 2:38 PM
dcaro created this task.
dcaro moved this task from To refine to Today on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T09:15:09Z] <wm-bot2> Rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T11:03:26Z] <wm-bot2> Rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T11:04:19Z] <wm-bot2> Rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T11:05:25Z] <wm-bot2> Rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T11:07:48Z] <wm-bot2> Rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T11:08:34Z] <wm-bot2> Rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-13T11:14:52Z] <wm-bot2> Finished rebooting node cloudcephosd1021.eqiad.wmnet (T309789) - cookbook ran by dcaro@vulcanus

Change 805108 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@wmcs] Use our own alert managing

https://gerrit.wikimedia.org/r/805108

Change 805108 merged by jenkins-bot:

[operations/cookbooks@wmcs] Use our own alert managing

https://gerrit.wikimedia.org/r/805108

dcaro lowered the priority of this task from High to Medium.Aug 26 2022, 10:28 AM
dcaro raised the priority of this task from Medium to High.
dcaro changed the task status from Open to In Progress.Jan 31 2023, 9:57 AM
dcaro moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q3) board.
fnegri renamed this task from ceph: upgrade hosts to bullseye to [ceph] Upgrade hosts to bullseye.Jan 22 2024, 5:34 PM
fnegri added a project: Cloud-VPS.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T09:59:53Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T10:00:01Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T10:00:28Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T11:16:29Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T309789)

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host cloudcephosd1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host cloudcephosd1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host cloudcephosd1006.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T14:45:37Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T14:45:43Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T14:45:46Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T14:45:59Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789)

Change #1049964 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudcephosd1006: update interface names

https://gerrit.wikimedia.org/r/1049964

Change #1049964 merged by David Caro:

[operations/puppet@production] cloudcephosd1006: update interface names

https://gerrit.wikimedia.org/r/1049964

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T15:09:32Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789)

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host cloudcephosd1006.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406261409_root_1179650_cloudcephosd1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T15:18:00Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T15:46:48Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-26T19:18:17Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-27T12:07:16Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-27T13:44:00Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T309789)

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1007.eqiad.wmnet with OS bullseye