Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs
Open, MediumPublic
Actions

Assigned To

None

Authored By

	MatthewVernon
	May 23 2022, 2:08 PM

Description

Related to T308644 & T308677 (unstable ordering of drives causing installer and puppet problems), which have been delaying T279637 (bullseye upgrade of ms-* swift clusters), we thought it might be better to use /sys/block/DEVICE/queue/rotational (and the equivalent puppet fact disk_type) to identify the SSDs. Unfortunately, this can't currently be done, because The Dell PowerEdge R7{3,4}0* systems fail to tell the OS which disks are SSDs:

mvernon@cumin2002:~$ sudo cumin 'ms-be*' 'facter dmi.product.name ; grep -l 0 /sys/block/*/queue/rotational || true'
84 hosts will be targeted:
ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1058,1060-1071].eqiad.wmnet
Ok to proceed on 84 hosts? Enter the number of affected hosts to confirm or "q" to quit 84
===== NODE GROUP =====                                                                                                 
(14) ms-be[2051-2056].codfw.wmnet,ms-be[1051-1058].eqiad.wmnet                                                         
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
ProLiant DL380 Gen10                                                                                                   
/sys/block/md0/queue/rotational                                                                                        
/sys/block/md1/queue/rotational
/sys/block/sda/queue/rotational
/sys/block/sdb/queue/rotational
===== NODE GROUP =====                                                                                                 
(25) ms-be[2057-2069].codfw.wmnet,ms-be[1060-1071].eqiad.wmnet                                                         
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
===== NODE GROUP =====                                                                                                 
(14) ms-be[2044-2050].codfw.wmnet,ms-be[1044-1050].eqiad.wmnet                                                         
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd                                                                                                       
===== NODE GROUP =====                                                                                                 
(8) ms-be[2040-2043].codfw.wmnet,ms-be[1040-1043].eqiad.wmnet                                                          
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R730xd                                                                                                       
===== NODE GROUP =====                                                                                                 
(23) ms-be[2028-2039].codfw.wmnet,ms-be[1028-1033,1035-1039].eqiad.wmnet                                               
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
ProLiant DL380 Gen9                                                                                                    
/sys/block/md0/queue/rotational                                                                                        
/sys/block/md1/queue/rotational
/sys/block/sda/queue/rotational
/sys/block/sdb/queue/rotational
================

To try and see if there was a swift-specific problem, look at all the hardware of these types at WMF:

mvernon@cumin2002:~$ sudo cumin 'F:boardproductname = 072T6D or F:boardproductname = 0C2PJH or F:boardproductname = 01KPX8 or F:boardproductname = 0VNGN1' 'facter dmi.product.name ; grep -l 0 /sys/block/*/queue/rotational || true'
111 hosts will be targeted:
an-presto[1001-1005].eqiad.wmnet,an-worker[1078-1095].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,backup[2003-2008].codfw.wmnet,backup[1003-1008].eqiad.wmnet,cloudstore[1008-1009].wikimedia.org,dumpsdata[1001-1002].eqiad.wmnet,kafka-jumbo[1001-1006].eqiad.wmnet,ms-be[2040-2046,2048-2050,2057-2069].codfw.wmnet,ms-be[1040-1045,1047-1050,1060-1071].eqiad.wmnet,stat1005.eqiad.wmnet
Ok to proceed on 111 hosts? Enter the number of affected hosts to confirm or "q" to quit 111
===== NODE GROUP =====                                                                                                 
(1) stat1005.eqiad.wmnet                                                                                               
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R730                                                                                                         
===== NODE GROUP =====                                                                                                 
(12) ms-be[2044-2046,2048-2050].codfw.wmnet,ms-be[1044-1045,1047-1050].eqiad.wmnet                                     
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd                                                                                                       
===== NODE GROUP =====                                                                                                 
(27) backup2008.codfw.wmnet,backup1008.eqiad.wmnet,ms-be[2057-2069].codfw.wmnet,ms-be[1060-1071].eqiad.wmnet           
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
===== NODE GROUP =====                                                                                                 
(1) backup1007.eqiad.wmnet                                                                                             
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
/sys/block/dm-0/queue/rotational                                                                                       
/sys/block/md0/queue/rotational
/sys/block/md1/queue/rotational
/sys/block/md2/queue/rotational
/sys/block/sdb/queue/rotational
/sys/block/sdc/queue/rotational
===== NODE GROUP =====                                                                                                 
(9) backup[2003-2007].codfw.wmnet,backup[1003-1006].eqiad.wmnet                                                        
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R740xd2                                                                                                      
/sys/block/dm-0/queue/rotational                                                                                       
/sys/block/md0/queue/rotational
/sys/block/md1/queue/rotational
/sys/block/md2/queue/rotational
/sys/block/sda/queue/rotational
/sys/block/sdb/queue/rotational
===== NODE GROUP =====                                                                                                 
(61) an-presto[1001-1005].eqiad.wmnet,an-worker[1078-1095].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet,cloudstore[1008-1009].wikimedia.org,dumpsdata[1001-1002].eqiad.wmnet,kafka-jumbo[1001-1006].eqiad.wmnet,ms-be[2040-2043].codfw.wmnet,ms-be[1040-1043].eqiad.wmnet
----- OUTPUT of 'facter dmi.produ...tational || true' -----                                                            
PowerEdge R730xd                                                                                                       
================

Most backup hosts are seen to have their SSDs correctly described to the OS. Inspecting the iDRAC for some of these systems shows that the SSDs are marked as non-RAID disks (rather than being used as single-member RAID-0 arrays); thus they are available to the system JBOD and so fully visible to the OS.
Indeed, the exceptional host backup2008 (where the SSDs are not marked as non-rotational) instead have them as single-member RAID-0 arrays. @jcrespo might be able to confirm whether this was an intended change of setup or not.

Some online docs (e.g. this RH article) suggest that for these RAID controllers the switch between RAID-0 and JBOD is lossless, which implies we could swap to that configuration for ms-systems, enabling us to move to using SSD vs not-SSD in installer/puppet. We currently have some pre-production ms nodes of the right hardware (e.g. ms-be2069) which would enable us to test this theory; although it's not clear if this migration could be done with any sort of automation or if instead each host would need its settings updating by hand...

Alternatively, I don't know if udev or the relevant SCSI drivers could be bullied into understanding the drives correctly; I'm not clear enough on the details to know if it's the RAID controller lying to the OS (since it knows the virtual disks are SSD) or the kernel misunderstanding what it's told.

Details

	Subject	Repo	Branch	Lines +/-
	sre.swift.convert-ssds: fix typo	operations/cookbooks	master	+1 -1
	sre.swift.convert-ssds: add new cookbook	operations/cookbooks	master	+169 -0

Customize query in gerrit

Related Objects

Mentioned In: rCCKB3bc5491bcb81: sre.swift.convert-ssds: fix typo
rCCKBc033f959913d: sre.swift.convert-ssds: add new cookbook
T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem
T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage
T313288: Recheck if we need innodb_flush_neighbors=ON
T309553: Degraded RAID on ms-be2066
Mentioned Here: T313288: Recheck if we need innodb_flush_neighbors=ON
T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options
T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage
T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem

Event Timeline

MatthewVernon created this task.May 23 2022, 2:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2022, 2:08 PM

jbond subscribed.May 23 2022, 2:18 PM

Indeed, the exceptional host backup2008 (where the SSDs are not marked as non-rotational) instead have them as single-member RAID-0 arrays. @jcrespo might be able to confirm whether this was an intended change of setup or not.

Sorry, with the double negative and the lack of context, I am unable to understand this complexly-worded question.

backup2008 was intended to be exactly as the other (new) backups hosts Small ssds for the os and HDs for storage- and its hw setup was made by dc ops. It currently is not fully setup, so it can be used for tests if helpful.

Sorry! I noticed that backup2008 is configured differently to all the other backup hosts (its SSDs are both 1-member RAID-0 arrays, rather than non-RAID disks), and I was wondering if this was intentional?
Given your comment, I think the answer is "no" :)

LSobanski subscribed.May 24 2022, 8:13 AM

Host rebooted by mvernon@cumin2002 with reason: testing non-RAIDing SSDs

There are (at least!) 4 ways to configure the RAID controller - its own setup utility (hit ^r during boot), the general BIOS setup (F2 during boot), the web-iDRAC interface, and the megacli tool. AFAICT the system needs to be not running while this process is carried out (because / is on the relevant drives).

I found using the RAID setup (^r during boot sequence) works, the process is:

Reboot node
Ctrl-r during boot sequence, to get into RAID controller setup
The first screen is VD Mgmt, select in turn the two small disks (typically the first two), F2 or Esc-2 for "Operations" to delete VD (and OK to confirm)
Ctrl-n to move to the next screen, PD Mgmt
Scroll down and select in turn the two SSDs (typically the last two), F2/Esc-2 for Operations, select "Convert to Non-RAID" (and OK to confirm)
Ctrl-n to move to the next screen, Ctrl Mgmt
scroll/tab to "select boot device", scroll down past the VDs to the first non-virtual disk, typically 00:02:24, select that
Tab to "Apply", hit Enter to apply
Esc repeatedly to exit the setup system; at the point it says "Ctrl-Alt-Delete to reboot" do that (or simply serveraction powercycle from racadm

This gets the SSDs to be non-RAID without harming their contents; it doesn't fix the problems with disk ordering in installer or puppet, but if we could (semi-)automatically roll this out to all the backends, then we could look to moving to relying on queue/rotational in future.

There is a 5th way and is via Redfish API ;)
We do have basic support for redfish API in spicerack right now and there is plan to add support for RAID configuration but that's not yet ready.
That said, if you have to perform this on a large number for hosts, it's could still probably be worth looking at a one-off way using what we have in spicerack and do the setup via redfish.
Feel free to ping me offline if you want to explore this approach.

Host rebooted by volans@cumin2002 with reason: Converted SSDs to non-RAID

Volans mentioned this in T309553: Degraded RAID on ms-be2066.May 30 2022, 4:51 PM

I spoke with @MatthewVernon and he kindly gave me ms-be2066 (pre-production host) to test the conversion of SSD disks from RAID PVs to non-raid disks via Redfish.

It seems pretty doable, but need to be fine-tuned.
I did try something today but the results weren't what I expected. As a result ms-be2066 is currently powered off because it's not able to reboot and to prevent the reboot-loop I've powered it off and downtimed it on Icinga/Alertmanager for 24h.

I'll dig more tomorrow morning on it to see what went wrong. Sorry for the future trouble.

MoritzMuehlenhoff subscribed.May 30 2022, 5:44 PM

ms-be2066 is back online. I've converted almost all via redfish in an automated way, but there is one bit to set the boot disk that so far I can't find the correct way to set it via redfish. I'll update as soon as I have more news.

Change 801693 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.swift.convert-ssds: add new cookbook

https://gerrit.wikimedia.org/r/801693

gerritbot added a project: Patch-For-Review.May 31 2022, 12:17 PM

Change 801693 merged by jenkins-bot:

[operations/cookbooks@master] sre.swift.convert-ssds: add new cookbook

https://gerrit.wikimedia.org/r/801693

Change 803238 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.swift.convert-ssds: fix typo

https://gerrit.wikimedia.org/r/803238

Change 803238 merged by jenkins-bot:

[operations/cookbooks@master] sre.swift.convert-ssds: fix typo

https://gerrit.wikimedia.org/r/803238

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2022, 10:30 AM

Remaining nodes done by hand during reboots for T310483:

mvernon@cumin1001:~$ sudo cumin 'ms-be*' 'cat /sys/block/md0/queue/rotational'
85 hosts will be targeted:
ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet
Ok to proceed on 85 hosts? Enter the number of affected hosts to confirm or "q" to quit 85
===== NODE GROUP =====                                                                                 
(85) ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet                               
----- OUTPUT of 'cat /sys/block/md0/queue/rotational' -----                                            
0                                                                                                      
================

Wanted to re-open this to understand the original problem. i.e. raid controllers reporting disks as rotational instead of ssd. From my understanding of the task we have fixed the swift specific issue by reconfiguring the disk so that instead of them being RAID-0, they are now configured as JBOD. However this doesn't work for systems that actually make use of the raid controller, see T313288 where the DB team would like to understand if they have any ssd disks.

It would be useful confirm if this issues affects all raid configurations or just this specific configuration (all disks in there own riad0 set), one suspects that it will affect all configuration but we should confirm. assuming this is a generic issue we should try to work out why the raid controller is lying to the kernal or at the very least indicate via the fact that the information is unreliable.

jbond removed MatthewVernon as the assignee of this task.Jul 19 2022, 3:59 PM

jbond triaged this task as Medium priority.

jbond edited projects, added DC-Ops; removed SRE-swift-storage.

Volans mentioned this in T313288: Recheck if we need innodb_flush_neighbors=ON.Jul 20 2022, 7:50 AM

MatthewVernon mentioned this in T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage.Aug 11 2022, 4:05 PM

jbond mentioned this in T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem.Sep 28 2022, 11:24 AM

Volans mentioned this in rCCKBc033f959913d: sre.swift.convert-ssds: add new cookbook.Dec 14 2022, 3:30 PM

Volans mentioned this in rCCKB3bc5491bcb81: sre.swift.convert-ssds: fix typo.