Page MenuHomePhabricator

Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs
Open, MediumPublic

Description

Related to T308644 & T308677 (unstable ordering of drives causing installer and puppet problems), which have been delaying T279637 (bullseye upgrade of ms-* swift clusters), we thought it might be better to use /sys/block/DEVICE/queue/rotational (and the equivalent puppet fact disk_type) to identify the SSDs. Unfortunately, this can't currently be done, because The Dell PowerEdge R7{3,4}0* systems fail to tell the OS which disks are SSDs:

mvernon@cumin2002:~$ sudo cumin 'ms-be*' 'facter dmi.product.name ; grep -l 0 /sys/block/*/queue/rotational || true'
84 hosts will be targeted:
ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1058,1060-1071].eqiad.wmnet
Ok to proceed on 84 hosts? Enter the number of affected hosts to confirm or "q" to quit 84
===== NODE GROUP =====                                                                                                 
(14) ms-be[2051-2056].codfw.wmnet,ms-be[1051-1058].eqiad.wmnet                                                         
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
ProLiant DL380 Gen10                                                                                                   
/sys/block/md0/queue/rotational                                                                                        
/sys/block/md1/queue/rotational
/sys/block/sda/queue/rotational
/sys/block/sdb/queue/rotational
===== NODE GROUP =====                                                                                                 
(25) ms-be[2057-2069].codfw.wmnet,ms-be[1060-1071].eqiad.wmnet                                                         
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
===== NODE GROUP =====                                                                                                 
(14) ms-be[2044-2050].codfw.wmnet,ms-be[1044-1050].eqiad.wmnet                                                         
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd                                                                                                       
===== NODE GROUP =====                                                                                                 
(8) ms-be[2040-2043].codfw.wmnet,ms-be[1040-1043].eqiad.wmnet                                                          
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R730xd                                                                                                       
===== NODE GROUP =====                                                                                                 
(23) ms-be[2028-2039].codfw.wmnet,ms-be[1028-1033,1035-1039].eqiad.wmnet                                               
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
ProLiant DL380 Gen9                                                                                                    
/sys/block/md0/queue/rotational                                                                                        
/sys/block/md1/queue/rotational
/sys/block/sda/queue/rotational
/sys/block/sdb/queue/rotational
================

To try and see if there was a swift-specific problem, look at all the hardware of these types at WMF:

mvernon@cumin2002:~$ sudo cumin 'F:boardproductname = 072T6D or F:boardproductname = 0C2PJH or F:boardproductname = 01KPX8 or F:boardproductname = 0VNGN1' 'facter dmi.product.name ; grep -l 0 /sys/block/*/queue/rotational || true'
111 hosts will be targeted:
an-presto[1001-1005].eqiad.wmnet,an-worker[1078-1095].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,backup[2003-2008].codfw.wmnet,backup[1003-1008].eqiad.wmnet,cloudstore[1008-1009].wikimedia.org,dumpsdata[1001-1002].eqiad.wmnet,kafka-jumbo[1001-1006].eqiad.wmnet,ms-be[2040-2046,2048-2050,2057-2069].codfw.wmnet,ms-be[1040-1045,1047-1050,1060-1071].eqiad.wmnet,stat1005.eqiad.wmnet
Ok to proceed on 111 hosts? Enter the number of affected hosts to confirm or "q" to quit 111
===== NODE GROUP =====                                                                                                 
(1) stat1005.eqiad.wmnet                                                                                               
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R730                                                                                                         
===== NODE GROUP =====                                                                                                 
(12) ms-be[2044-2046,2048-2050].codfw.wmnet,ms-be[1044-1045,1047-1050].eqiad.wmnet                                     
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd                                                                                                       
===== NODE GROUP =====                                                                                                 
(27) backup2008.codfw.wmnet,backup1008.eqiad.wmnet,ms-be[2057-2069].codfw.wmnet,ms-be[1060-1071].eqiad.wmnet           
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
===== NODE GROUP =====                                                                                                 
(1) backup1007.eqiad.wmnet                                                                                             
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
/sys/block/dm-0/queue/rotational                                                                                       
/sys/block/md0/queue/rotational
/sys/block/md1/queue/rotational
/sys/block/md2/queue/rotational
/sys/block/sdb/queue/rotational
/sys/block/sdc/queue/rotational
===== NODE GROUP =====                                                                                                 
(9) backup[2003-2007].codfw.wmnet,backup[1003-1006].eqiad.wmnet                                                        
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
/sys/block/dm-0/queue/rotational                                                                                       
/sys/block/md0/queue/rotational
/sys/block/md1/queue/rotational
/sys/block/md2/queue/rotational
/sys/block/sda/queue/rotational
/sys/block/sdb/queue/rotational
===== NODE GROUP =====                                                                                                 
(61) an-presto[1001-1005].eqiad.wmnet,an-worker[1078-1095].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,cloudstore[1008-1009].wikimedia.org,dumpsdata[1001-1002].eqiad.wmnet,kafka-jumbo[1001-1006].eqiad.wmnet,ms-be[2040-2043].codfw.wmnet,ms-be[1040-1043].eqiad.wmnet
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R730xd                                                                                                       
================

Most backup hosts are seen to have their SSDs correctly described to the OS. Inspecting the iDRAC for some of these systems shows that the SSDs are marked as non-RAID disks (rather than being used as single-member RAID-0 arrays); thus they are available to the system JBOD and so fully visible to the OS.
Indeed, the exceptional host backup2008 (where the SSDs are not marked as non-rotational) instead have them as single-member RAID-0 arrays. @jcrespo might be able to confirm whether this was an intended change of setup or not.

Some online docs (e.g. this RH article) suggest that for these RAID controllers the switch between RAID-0 and JBOD is lossless, which implies we could swap to that configuration for ms-systems, enabling us to move to using SSD vs not-SSD in installer/puppet. We currently have some pre-production ms nodes of the right hardware (e.g. ms-be2069) which would enable us to test this theory; although it's not clear if this migration could be done with any sort of automation or if instead each host would need its settings updating by hand...

Alternatively, I don't know if udev or the relevant SCSI drivers could be bullied into understanding the drives correctly; I'm not clear enough on the details to know if it's the RAID controller lying to the OS (since it knows the virtual disks are SSD) or the kernel misunderstanding what it's told.

Event Timeline

Indeed, the exceptional host backup2008 (where the SSDs are not marked as non-rotational) instead have them as single-member RAID-0 arrays. @jcrespo might be able to confirm whether this was an intended change of setup or not.

Sorry, with the double negative and the lack of context, I am unable to understand this complexly-worded question.

backup2008 was intended to be exactly as the other (new) backups hosts Small ssds for the os and HDs for storage- and its hw setup was made by dc ops. It currently is not fully setup, so it can be used for tests if helpful.

Sorry! I noticed that backup2008 is configured differently to all the other backup hosts (its SSDs are both 1-member RAID-0 arrays, rather than non-RAID disks), and I was wondering if this was intentional?
Given your comment, I think the answer is "no" :)

Host rebooted by mvernon@cumin2002 with reason: testing non-RAIDing SSDs

There are (at least!) 4 ways to configure the RAID controller - its own setup utility (hit ^r during boot), the general BIOS setup (F2 during boot), the web-iDRAC interface, and the megacli tool. AFAICT the system needs to be not running while this process is carried out (because / is on the relevant drives).

I found using the RAID setup (^r during boot sequence) works, the process is:

  1. Reboot node
  2. Ctrl-r during boot sequence, to get into RAID controller setup
  3. The first screen is VD Mgmt, select in turn the two small disks (typically the first two), F2 or Esc-2 for "Operations" to delete VD (and OK to confirm)
  4. Ctrl-n to move to the next screen, PD Mgmt
  5. Scroll down and select in turn the two SSDs (typically the last two), F2/Esc-2 for Operations, select "Convert to Non-RAID" (and OK to confirm)
  6. Ctrl-n to move to the next screen, Ctrl Mgmt
  7. scroll/tab to "select boot device", scroll down past the VDs to the first non-virtual disk, typically 00:02:24, select that
  8. Tab to "Apply", hit Enter to apply
  9. Esc repeatedly to exit the setup system; at the point it says "Ctrl-Alt-Delete to reboot" do that (or simply serveraction powercycle from racadm

This gets the SSDs to be non-RAID without harming their contents; it doesn't fix the problems with disk ordering in installer or puppet, but if we could (semi-)automatically roll this out to all the backends, then we could look to moving to relying on queue/rotational in future.

There is a 5th way and is via Redfish API ;)
We do have basic support for redfish API in spicerack right now and there is plan to add support for RAID configuration but that's not yet ready.
That said, if you have to perform this on a large number for hosts, it's could still probably be worth looking at a one-off way using what we have in spicerack and do the setup via redfish.
Feel free to ping me offline if you want to explore this approach.

Host rebooted by volans@cumin2002 with reason: Converted SSDs to non-RAID

I spoke with @MatthewVernon and he kindly gave me ms-be2066 (pre-production host) to test the conversion of SSD disks from RAID PVs to non-raid disks via Redfish.

It seems pretty doable, but need to be fine-tuned.
I did try something today but the results weren't what I expected. As a result ms-be2066 is currently powered off because it's not able to reboot and to prevent the reboot-loop I've powered it off and downtimed it on Icinga/Alertmanager for 24h.

I'll dig more tomorrow morning on it to see what went wrong. Sorry for the future trouble.

ms-be2066 is back online. I've converted almost all via redfish in an automated way, but there is one bit to set the boot disk that so far I can't find the correct way to set it via redfish. I'll update as soon as I have more news.

Change 801693 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.swift.convert-ssds: add new cookbook

https://gerrit.wikimedia.org/r/801693

Change 801693 merged by jenkins-bot:

[operations/cookbooks@master] sre.swift.convert-ssds: add new cookbook

https://gerrit.wikimedia.org/r/801693

Change 803238 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.swift.convert-ssds: fix typo

https://gerrit.wikimedia.org/r/803238

Change 803238 merged by jenkins-bot:

[operations/cookbooks@master] sre.swift.convert-ssds: fix typo

https://gerrit.wikimedia.org/r/803238

MatthewVernon claimed this task.

Remaining nodes done by hand during reboots for T310483:

mvernon@cumin1001:~$ sudo cumin 'ms-be*' 'cat /sys/block/md0/queue/rotational'
85 hosts will be targeted:
ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet
Ok to proceed on 85 hosts? Enter the number of affected hosts to confirm or "q" to quit 85
===== NODE GROUP =====                                                                                 
(85) ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet                               
----- OUTPUT of 'cat /sys/block/md0/queue/rotational' -----                                            
0                                                                                                      
================
jbond reopened this task as Open.EditedJul 19 2022, 3:58 PM

Wanted to re-open this to understand the original problem. i.e. raid controllers reporting disks as rotational instead of ssd. From my understanding of the task we have fixed the swift specific issue by reconfiguring the disk so that instead of them being RAID-0, they are now configured as JBOD. However this doesn't work for systems that actually make use of the raid controller, see T313288 where the DB team would like to understand if they have any ssd disks.

It would be useful confirm if this issues affects all raid configurations or just this specific configuration (all disks in there own riad0 set), one suspects that it will affect all configuration but we should confirm. assuming this is a generic issue we should try to work out why the raid controller is lying to the kernal or at the very least indicate via the fact that the information is unreliable.

jbond triaged this task as Medium priority.
jbond edited projects, added DC-Ops; removed SRE-swift-storage.