Page MenuHomePhabricator

Q4:rack/setup/install cloudcephosd10[35-38]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: cloudcephosd10[35-38].eqiad.wmnet
Racking Proposal: Two hosts in F4, one each in C8 and D5
Networking Setup: 2 x 10G interfaces per server, connected to cloudswitches. Check other hosts (e.g. cloudcephosd1012) for switch config specifics.
Partitioning/Raid: SW raid mirror for the two smaller OS drives, other drives can be left unpartitioned for ceph management.
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: David Caro

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudcephosd1035
  • Receive in system on procurement task T351332 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1036
  • Receive in system on procurement task T351332 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1037
  • Receive in system on procurement task T351332 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1038
  • Receive in system on procurement task T351332 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH mentioned this in Unknown Object (Task).Apr 24 2024, 3:20 PM
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

cloudcephosd1035
Rack: C8
U 28
CableID: 5335
Port: 20

cloudcephosd1036
Rack: D5
U 18
CableID: 5337
Port: 18

cloudcephosd1037
Rack: F4
U 34
CableID: 20220089
Port: 41

cloudcephosd1038
Rack: F4
U 33
CableID: 20220015
Port: 2

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1035 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1035.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1036.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1035 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407012322_jclark_2236473_cloudcephosd1035.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1036.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1036 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407012336_jclark_2236965_cloudcephosd1036.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1037 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407012354_jclark_2238404_cloudcephosd1037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1038 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407012357_jclark_2238507_cloudcephosd1038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr updated the task description. (Show Details)
Jclark-ctr updated Other Assignee, added: VRiley-WMF.
Jclark-ctr added a subscriber: cmooney.

@VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney

@VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney

@Jclark-ctr and @cmooney I have plugged in a 2nd network cable. Here is that information

cloudcephosd1035 - CableID 5328 : Port 42

cloudcephosd1036 - CableID 5348 : Port 21

cloudcephosd1037 - CableID 20220044 : Port 33

cloudcephosd1038 - CableID 20220013 : Port 3

Thanks guys, the second ports are now configured on the switches.

I should say cloudcephosd1036 change I've not pushed to the switch - that will happen when we do a homer run after the planned reboot/upgrade (T371879)

Change #1060146 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: add new cloudcephosd1035

https://gerrit.wikimedia.org/r/1060146

Change #1060188 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudcephosd103[578] into ceph osd nodes

https://gerrit.wikimedia.org/r/1060188

Change #1060188 merged by Andrew Bogott:

[operations/puppet@production] Make cloudcephosd103[578] into ceph osd nodes

https://gerrit.wikimedia.org/r/1060188

Change #1060190 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add ceph config for cloudcephosd103[5-8]

https://gerrit.wikimedia.org/r/1060190

Change #1060190 merged by Andrew Bogott:

[operations/puppet@production] Add ceph config for cloudcephosd103[5-8]

https://gerrit.wikimedia.org/r/1060190

Change #1060337 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph.osd: move the new 103[5-8] nodes to the per-rack ip blocks

https://gerrit.wikimedia.org/r/1060337

Change #1060146 abandoned by David Caro:

[operations/puppet@production] ceph: add new cloudcephosd1035

Reason:

Superseded by https://gerrit.wikimedia.org/r/1060188

https://gerrit.wikimedia.org/r/1060146

Change #1060337 merged by David Caro:

[operations/puppet@production] ceph.osd: move the new 103[5-8] nodes to the per-rack ip blocks

https://gerrit.wikimedia.org/r/1060337

Change #1060381 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudceph.osd: remove 1036 as we are not adding it yet

https://gerrit.wikimedia.org/r/1060381

Change #1060381 merged by David Caro:

[operations/puppet@production] cloudceph.osd: remove 1036 as we are not adding it yet

https://gerrit.wikimedia.org/r/1060381

cloudcephosd1035 has one drive that wrongly assigned as 'os raid':

sdb                                                                                                     8:16   0   3.5T  0 disk  
├─sdb1                                                                                                  8:17   0   285M  0 part  
└─sdb2                                                                                                  8:18   0   3.5T  0 part  
  └─md0                                                                                                 9:0    0 446.7G  0 raid1 
    ├─vg0-swap                                                                                        253:0    0   976M  0 lvm   [SWAP]
    ├─vg0-root                                                                                        253:1    0  74.5G  0 lvm   /
    └─vg0-srv                                                                                         253:2    0 281.9G  0 lvm   /srv

Should be replaced by:

sdi                                                                                                     8:128  0 447.1G  0 disk  
└─ceph--3bd6a2a5--a480--4053--a42c--d93ea5f3a87c-osd--block--c0123ead--e1dd--409b--9478--b93401ac8bd9 253:9    0 447.1G  0 lvm

Change #1060402 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] parted: add a recipe to autouse the two smaller disks

https://gerrit.wikimedia.org/r/1060402

Change #1060402 merged by David Caro:

[operations/puppet@production] partman: use the same recipe for cloudcephosd than cephosd

https://gerrit.wikimedia.org/r/1060402

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1037 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408071328_dcaro_712402_cloudcephosd1037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1060450 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudcephosd: use the new partitions on the new hosts

https://gerrit.wikimedia.org/r/1060450

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T14:07:39Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T14:18:30Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T363344)

Change #1060450 merged by Andrew Bogott:

[operations/puppet@production] cloudcephosd: use the new partitions on the new hosts

https://gerrit.wikimedia.org/r/1060450

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1038 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1038.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1038 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408071557_andrew_732967_cloudcephosd1038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1038.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1038 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408071814_andrew_750249_cloudcephosd1038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudcephosd1038.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T11:51:12Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T11:51:18Z] <dcaro@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T11:51:21Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T11:51:29Z] <dcaro@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T11:51:51Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T15:44:11Z] <dcaro@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=97) (T363344)

@cmooney can we get cloudcephosd1036 set up now that the switch work is done?

(meanwhile I am draining and rebuilding cloudcephosd1035 because it was built with improper drive assignments.)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:45:26Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:50:41Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:51:27Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:52:54Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:53:41Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:57:51Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T03:58:37Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:25:33Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:26:20Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:32:33Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:33:19Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:42:00Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:42:47Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:45:54Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T04:46:41Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T05:40:30Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T363344)

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1035 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408161348_andrew_2376324_cloudcephosd1035.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-16T14:28:01Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T363344)

@cmooney says about cloudcephosd1036:

there is no sfp in port 21 on cloudsw1-d5-eqiad however so maybe check with dc-ops on that

Sounds like a cable needs a wiggle

plugged the port in and also reseated management cable

Change #1063861 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Put cloudcephosd1036 into service

https://gerrit.wikimedia.org/r/1063861

verified cable and link lights

Change #1063861 merged by Andrew Bogott:

[operations/puppet@production] Put cloudcephosd1036 into service

https://gerrit.wikimedia.org/r/1063861