Page MenuHomePhabricator

Put cloudcephosd10[39-41] into service
Closed, ResolvedPublic

Description

Just a reminder that these are racked and installed but not part of the ceph cluster yet

# new cloudceph storage nodes T361366
node /^cloudcephosd10(39|4[0-1])\.eqiad\./ {
    role(insetup::wmcs)
}

Event Timeline

All three of these need reimaging to get the drive labels set up properly; right now they all have a big OSD drive assigned to the os.

Change #1063892 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudcephosd1039-1041 into ceph osd nodes

https://gerrit.wikimedia.org/r/1063892

These are now rebuilt with proper partitioning. They probably shouldn't be bootstrapped until T372821 is resolved.

Andrew triaged this task as Medium priority.Aug 21 2024, 2:11 PM

@Andrew i see this ticket is in my name. is there something i need to do for this?

@Andrew i see this ticket is in my name. is there something i need to do for this?

I think that was just overlooked when creating the task, xd, I'll take it

Change #1063892 merged by David Caro:

[operations/puppet@production] Make cloudcephosd1039-1041 into ceph osd nodes

https://gerrit.wikimedia.org/r/1063892

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1039 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console cloudcephosd1039.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Change #1075552 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudcephosd: don't remove_os_md

https://gerrit.wikimedia.org/r/1075552

Change #1075552 merged by David Caro:

[operations/puppet@production] cloudcephosd: don't remove_os_md

https://gerrit.wikimedia.org/r/1075552

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1039 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409251717_dcaro_85072_cloudcephosd1039.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Mentioned in SAL (#wikimedia-operations) [2024-09-26T07:40:55Z] <dcaro@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Run failed when reimaging cloudcephosd1039 and asked to run manually - dcaro@cumin1002 - T372814"

Mentioned in SAL (#wikimedia-operations) [2024-09-26T07:41:01Z] <dcaro@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Run failed when reimaging cloudcephosd1039 and asked to run manually - dcaro@cumin1002 - T372814"

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:42:33Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:42:41Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:45:23Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:45:29Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:46:19Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:46:24Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:47:03Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:47:09Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:48:03Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Change #1075844 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudcephosd1040/41: force puppet 7

https://gerrit.wikimedia.org/r/1075844

Change #1075844 merged by David Caro:

[operations/puppet@production] cloudcephosd1040/41: force puppet 7

https://gerrit.wikimedia.org/r/1075844

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T07:59:31Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:04:33Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:04:42Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:52:51Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:53:02Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:54:33Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:54:38Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:54:56Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:55:01Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:55:26Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:55:34Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:56:38Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:56:45Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T08:57:35Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1040 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409260820_dcaro_219528_cloudcephosd1040.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console cloudcephosd1040.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1040 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409260921_dcaro_233044_cloudcephosd1040.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T10:11:41Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T10:25:09Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T10:25:21Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T10:26:35Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T372814)

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1041 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409261015_dcaro_245878_cloudcephosd1041.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T14:28:11Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-26T14:28:18Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-30T09:00:00Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-30T09:00:07Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-30T09:00:59Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-30T13:47:39Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-30T16:35:02Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-30T20:12:18Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-01T08:10:10Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-01T08:18:26Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-01T08:23:02Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T372814)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-10-01T15:40:58Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T372814)

Done! all three upgraded, setup and joined in the cluster.