Page MenuHomePhabricator

Install new disk controllers to SM swift backends (eqiad)
Closed, ResolvedPublic

Description

In T393941, we acquired 9 JBOD controllers to retrofit into the SM swift backends in eqiad. They need to be installed (and this task is to track that). However, each node must be entirely removed from the swift rings first (and then put back with the new device locations), which will require co-ordination.

Please only replace the controller into a node marked READY in the list below.

  • ms-be1083 (done)
  • ms-be1084 (done)
  • ms-be1085 (done)
  • ms-be1086 (done)
  • ms-be1087 (done)
  • ms-be1088 (done) Return to service delayed for testing re T404356
  • ms-be1089 (done)
  • ms-be1090 (done)
  • ms-be1091 (done)
  • thanos-be1005 (done)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1183627 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove 3 drained eqiad nodes for disk controller swap

https://gerrit.wikimedia.org/r/1183627

Change #1183628 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add 3 nodes, drain the next 3

https://gerrit.wikimedia.org/r/1183628

Change #1183627 merged by MVernon:

[operations/puppet@production] swift: remove 3 drained eqiad nodes for disk controller swap

https://gerrit.wikimedia.org/r/1183627

Icinga downtime and Alertmanager silence (ID=5d9cb26e-171b-4940-aeef-3b79dd0f568e) set by mvernon@cumin2002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: awaiting controller swap

ms-be[1083-1085].eqiad.wmnet

@VRiley-WMF three nodes - ms-be1083 ms-be1084 ms-be1085 are now ready for disk swaps, as soon as you've some time, please. I've downtimed them for 2 days.

VRiley-WMF changed the task status from Open to In Progress.Sep 1 2025, 7:04 PM

ms-be1083 has been completed. moving onto ms-be1084

ms-be1084 completed. Moving onto ms-be1085

VRiley-WMF changed the task status from In Progress to Open.Sep 1 2025, 10:40 PM
VRiley-WMF updated the task description. (Show Details)

ms-be1085 is completed

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1084.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1085.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet with OS bullseye executed with errors:

  • ms-be1083 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be1083.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1084.eqiad.wmnet with OS bullseye completed:

  • ms-be1084 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509020914_mvernon_647760_ms-be1084.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1085.eqiad.wmnet with OS bullseye completed:

  • ms-be1085 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509020917_mvernon_648375_ms-be1085.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1083.eqiad.wmnet with OS bullseye completed:

  • ms-be1083 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509021019_mvernon_655251_ms-be1083.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1183628 merged by MVernon:

[operations/puppet@production] swift: re-add 3 nodes, drain the next 3

https://gerrit.wikimedia.org/r/1183628

Hey @MatthewVernon I wanted to check back in with this ticket and see if any of these are available to commence with the swap. No rush, just wanted to check. Thanks!

Change #1190673 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove 3 drained eqiad nodes for disk controller swap

https://gerrit.wikimedia.org/r/1190673

Change #1190674 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add 3 nodes, drain the final 2

https://gerrit.wikimedia.org/r/1190674

Change #1190673 merged by MVernon:

[operations/puppet@production] swift: remove 3 drained eqiad nodes for disk controller swap

https://gerrit.wikimedia.org/r/1190673

Icinga downtime and Alertmanager silence (ID=08fe2c84-09f5-45bc-a5f4-ae2812070ff6) set by mvernon@cumin2002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: awaiting controller swap

ms-be[1086-1088].eqiad.wmnet

Hi @VRiley-WMF, yes there are - 3 nodes are now ready ms-be1086, ms-be1087, ms-be1088. Please swap when you've some time :)

Apologies for the delayed response, I was on leave last week.

Hi @VRiley-WMF do you think you'll be able to do these swaps this week, please?

Icinga downtime and Alertmanager silence (ID=f05e1660-c13c-4689-a96d-eaccf6967088) set by mvernon@cumin2002 for 4 days, 0:00:00 on 3 host(s) and their services with reason: awaiting controller swap

ms-be[1086-1088].eqiad.wmnet

Hey @MatthewVernon Yes, I am planning on doing this today. I apologize as I was out for two days last week.

VRiley-WMF changed the task status from Open to In Progress.Sep 29 2025, 5:16 PM

Starting work on ms-be1087 (will get to ms-be1086 in a bit. starting with the cage I'm currently in)

Finished updating ms-be1087, moving onto ms-be1088

VRiley-WMF changed the task status from In Progress to Open.Sep 29 2025, 7:50 PM
VRiley-WMF updated the task description. (Show Details)

These are all done! will await for the next two. Thanks @MatthewVernon

These are all done! will await for the next two. Thanks @MatthewVernon

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1086.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1086.eqiad.wmnet with OS bullseye completed:

  • ms-be1086 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509300909_mvernon_3766436_ms-be1086.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1087.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1087.eqiad.wmnet with OS bullseye completed:

  • ms-be1087 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509301044_mvernon_3817060_ms-be1087.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1190674 merged by MVernon:

[operations/puppet@production] swift: re-add 2 nodes, drain the final 2, leave 1 for testing

https://gerrit.wikimedia.org/r/1190674

Hey @MatthewVernon Just wanted to check in and see if the other two maybe be ready? Let us now, thanks!

Hi @VRiley-WMF I'm afraid not (filesystems still about 25% full, so a little way to go yet).

Change #1198005 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove ms-be10{89,90} for controller swap

https://gerrit.wikimedia.org/r/1198005

Change #1198005 merged by MVernon:

[operations/puppet@production] swift: remove ms-be10{89,90} for controller swap

https://gerrit.wikimedia.org/r/1198005

Icinga downtime and Alertmanager silence (ID=cea00150-47a1-46ce-a142-ec46d9e47678) set by mvernon@cumin1003 for 3 days, 0:00:00 on 2 host(s) and their services with reason: awaiting controller swap

ms-be[1089-1090].eqiad.wmnet

@VRiley-WMF the last two nodes ms-be1089 and ms-be1090 are ready for controller swap, please; I've downtimed them for a couple of days.

VRiley-WMF changed the task status from Open to In Progress.Oct 22 2025, 7:38 PM

Starting on ms-be1089

ms-be1089 is completed, moving onto the next server ms-be1090

I'm having issues bringing ms-be1090 back up. Will continue to work on this

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1089.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1089.eqiad.wmnet with OS bullseye completed:

  • ms-be1089 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510231015_mvernon_3273107_ms-be1089.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Created a ticket for ms-be1090 for Supermicro to assist

Case #00061744

Attempted to swap the unit and it wouldn't power back on.
Swapped it back out with the old one, and it still won't power on.
Checked the cables and reseated them with no change.
When looking at the unit, it seems to not detect the PSU.
Support came back with the following and I have responded to them with the following.

  1. Ensure that all cables, including power and SAS cables from drives to the backplane, are properly connected. - This has been completed with no change.
  1. Check for the power button and ribbon cable issues. - This has been completed with no change.
  1. Please remove the Controller and try to power on the server - This has been completed, no change.
  1. Please share the server health event log and sensor log - I have attached it to the email.

awaiting their response.

Removed all RAM from unit (except 1) to see if it would boot. Found that it did boot normally. I'm slowly adding more RAM to find out which one of these is causing the issue.

After reseating the RAM, it seems lke everything has come back up and it's showing a healthy status. @MatthewVernon Can you please verify if everything looks good on your end for ms-be1090?

@VRiley-WMF the host is up, but it can't reach any of its spinning disks (the OS sees none, and the BMC says 0 physical disks). Could you take another look, please?

Of course, I'm looking into this now.

Hey @MatthewVernon I apologize about that. It seems the cable slipped out of the card while I was trying to diagnose the issue. It's been reseated. Please check it when possible

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1090.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1090.eqiad.wmnet with OS bullseye completed:

  • ms-be1090 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510310917_mvernon_1484055_ms-be1090.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1200288 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Return ms-be10{89,90} to the rings

https://gerrit.wikimedia.org/r/1200288

MatthewVernon updated the task description. (Show Details)

@VRiley-WMF looks good now, thanks!

Change #1200288 merged by MVernon:

[operations/puppet@production] Return ms-be10{89,90} to the rings

https://gerrit.wikimedia.org/r/1200288