Page MenuHomePhabricator

mw2420-mw2451 do have unnecessary raid controllers (configured)
Closed, ResolvedPublic

Description

While working on T357380: Degraded RAID on mw2442 I realized a strange looking RAID controller config (for each disk there is one "RAID-0" virtual drive created which then is used to create an mdadm software RAID) which turns out to be the same on all 32 hosts of that batch.

As of the procurement task T325215 those systems (Config C-1G) should not have RAID controllers at all, so I assume something went wrong during procurement as well as during provisioning as T326362 does not request HW-Raid config.

This is not ideal as it makes those 32 hosts different from the others and it does require extra care/extra steps in case of disk replacements (see T357380#9575876). We should probably re-provision those hosts with the RAID controllers configured in HBA mode or have the RAID controllers removed (if that's even possible and it makes sense to keep spares).

Opening this task to figure out what to do or at least keep the information around that probably the following steps are needed when a disk is replaced to make it appear in the OS:

megacli -GetPreservedCacheList -a0
megacli -DiscardPreservedCache -L'disk_number' -a0
megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

GUI procedure to switch HW-RAID controller to Enhanced HBA

  • Connect to idrac http interface through an SSH tunnel
  • Go to "Storage", "Virtual disks"
  • Click the dropdown for the first disk, choose "Delete" then "Apply later"
  • Repeat for the second virtual disk, but choose "Apply now", then "Job queue"
  • Once you see the task as complete in the Job queue, go to "Storage", "Controllers"
  • Click the controller drop down, choose "Edit"
  • Change "Controller Mode" to "Enhanced HBA"
  • Click "Add to pending", "At next reboot"
  • Perform a warm reboot from the UI
  • Run sudo cookbook sre.hosts.provision --no-dhcp --no-switch --no-users myhost
  • When asked if the RAID was modified, type modified and proceed
  • You can monitor the state of the RAID configuration through the job queue
  • The cookbook will probably error because the config change takes a while, just wait until the configuration step completes then type retry
  • You can now rename and reimage the server

Hosts with HW-RAID controller left to switch to Enhanced HBA

  • mw2422 (now wikikube-worker2074)
  • mw2423 (now wikikube-worker2075)
  • mw2420 (now wikikube-worker2091)
  • mw2421 (now wikikube-worker2092)
  • mw2424 (now wikikube-worker2124)
  • mw2425 (now wikikube-worker2125)
  • mw2426 (now wikikube-worker2126)
  • mw2427 (now wikikube-worker2127)
  • mw2430 (now wikikube-worker2103)
  • mw2431 (now wikikube-worker2104)
  • mw2428 (now wikikube-worker2105)
  • mw2429 (now wikikube-worker2106)
  • mw2432 (now wikikube-worker2035)
  • mw2433 (now wikikube-worker2036)
  • mw2434 (now wikikube-worker2089)
  • mw2435 (now wikikube-worker2090)
  • mw2436
  • mw2437
  • mw2438 (now wikikube-worker2037)
  • mw2439 (now wikikube-worker2038)
  • mw2440
  • mw2441 (now wikikube-worker2039)
  • mw2442
  • mw2443
  • mw2444
  • mw2445
  • mw2446
  • mw2447
  • mw2448
  • mw2449
  • mw2450
  • mw2451

Event Timeline

JMeybohm renamed this task from mw2420-mw2451 do have unncecesarry raid controllers (configured to mw2420-mw2451 do have unncecesarry raid controllers (configured).

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

Moritz asked me about this, and I have some background. So orders placed in January 2023 via the dell portal for standard configs also included a number of hosts with raid which should not have had raid.

We did not 'pay' for the raid controllers as we had the set Config C per unit price, but it was misconfigured with a raid controller. This has seen been fixed (was discovered when they landed from the first batch of orders) but requires a work around on those hosts.

Set each disk as its own 1 disk raid0, and then it'll operate normally like a raidless system in terms of OS partitioning and the like. We've since modified our process of ordering to ensure this doesn't happen again.

I'm told there is a question on 'can we pull these raid controllers to use elsewhere' and the answer is 'no, or the host you remove it from has no controller.'

These are wired with cables from the backplane to the raid controller, which are not the same custom length cables to route in the chassis to an onboard sata controller that likely isn't present in these hosts. I would not recommend attempting to pull hardware from the R440 and install into another host, as it'll break warranty for both.

It could be possible, but it would require someone to take time to offline a host from this small batch of affected hosts and see if the cables can reach and if an onboard controller is even present. As this was a small one-off event of Config C having raid, I'd recommend we just leave it as is.

Please note that my understanding could be wrong, we may want to create a task for on-site to pull one of these hosts and double check the above.

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

We could verify that while T351074: Move servers from the appserver/api cluster to kubernetes (although most of the servers have already been moved to k8s unfortunately).

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

We could verify that while T351074: Move servers from the appserver/api cluster to kubernetes (although most of the servers have already been moved to k8s unfortunately).

We'll try with one of the not yet migrated to k8s server first to see if it has any performance implications and then try to phase out the RAID config when the servers need reimaging for the name change anyways.

JMeybohm renamed this task from mw2420-mw2451 do have unncecesarry raid controllers (configured) to mw2420-mw2451 do have unnecessary raid controllers (configured).Feb 28 2024, 5:49 PM

@JMeybohm hello is there anything DC-ops need to do on this task?

@JMeybohm hello is there anything DC-ops need to do on this task?

No, all good from your end. Cheers

Change #1054531 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.hosts.convert-disk: Generalize sre.swift.convert-disks

https://gerrit.wikimedia.org/r/1054531

Icinga downtime and Alertmanager silence (ID=db2972bf-cd24-4ee8-ba43-a5d1d6710956) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: RAID conversion testing

mw2432.codfw.wmnet

Change #1055237 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: rename 4 appservers to k8s workers

https://gerrit.wikimedia.org/r/1055237

I *tried* very hard to automate it with a cookbook, but the behavior is wildly inconsistent between runs, sometimes requiring a reboot for deleting vdisks, sometimes not, sometimes starting to run the vdisk deletion only to fail without an explanation.

These PERC H745 controllers behave differently from the integrated controllers we usually have, which means Dell's own action urls like ConvertToNonRAID (used by the sre.swift.convert-disks cookbook) don't work either.

It is *mostly* the virtual disk removal automation posing problems, using redfish.scp_dump to get the RAID configuration and switching it to Enhanced HBA seems to work relatively reliably, but is slow and not worth automating if it's the only part we can script and have to do the vdisk removal in the GUI.

At this point, I am modifying the cookbook every run to try and get it to work, and I probably could have configured all 30-something hosts in this task through the GUI in the time I've spent on this.

The GUI procedure is:

  • Connect to idrac http interface through an SSH tunnel
  • Go to "Storage", "Virtual disks"
  • Click the dropdown for the first disk, choose "Delete" then "Apply later"
  • Repeat for the second virtual disk, but choose "Apply now", then "Job queue"
  • Once you see the task as complete in the Job queue, go to "Storage", "Controllers"
  • Click the controller drop down, choose "Edit"
  • Change "Controller Mode" to "Enhanced HBA"
  • Click "Add to pending", "At next reboot"
  • Run sudo cookbook sre.hosts.provision --no-dhcp --no-switch --no-users myhost
  • When asked if the RAID was modified, type modified and proceed
  • You can monitor the state of the RAID configuration through the job queue
  • The cookbook will probably error because the config change takes a while, just wait until the configuration step completes then type retry
  • You can now rename and reimage the server

Change #1055237 merged by Clément Goubert:

[operations/puppet@production] kubernetes: rename 4 appservers to k8s workers

https://gerrit.wikimedia.org/r/1055237

Change #1057829 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: reimage 1 appserver to kubernetes

https://gerrit.wikimedia.org/r/1057829

Change #1057829 merged by Clément Goubert:

[operations/puppet@production] kubernetes: reimage 1 appserver to kubernetes

https://gerrit.wikimedia.org/r/1057829

Clement_Goubert updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2092.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2092.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2092 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410111058_cgoubert_3576371_wikikube-worker2092.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Clement_Goubert changed the task status from Open to In Progress.Oct 11 2024, 11:28 AM
Clement_Goubert updated the task description. (Show Details)
JMeybohm claimed this task.
JMeybohm updated the task description. (Show Details)

Well, that was a pretty painful experience - thanks @Clement_Goubert for working out the procedure!