Initial Issue Report
It seems that the hardware is different than the one for the hosts we used for the POC and current kernel/drivers are unable to properly detect the drives.
Kernel rotational setting:
root@cloudcephosd1012:~# cat /sys/block/sdb/queue/rotational 1
Smartctl is able to show more info, though not complete:
# smartctl -a /dev/sdb smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-10-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org Smartctl open device: /dev/sdb failed: DELL or MegaRaid controller, please try adding '-d megaraid,N' root@cloudcephosd1012:~# smartctl -a /dev/sdb -d megaraid,[ smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-10-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org /dev/sdb: Unknown device type 'megaraid,[' =======> VALID ARGUMENTS ARE: ata, scsi, nvme[,NSID], sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, intelliprop,N[+TYPE], marvell, areca,N/E, 3ware,N, hp t,L/M/N, megaraid,N, aacraid,H,L,ID, cciss,N, auto, test <======= Use smartctl -h to get a usage summary root@cloudcephosd1012:~# smartctl -a /dev/sdb -d megaraid,0 smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-10-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: MTFDDAK240TCB Serial Number: 200225EE619E LU WWN Device Id: 5 00a075 125ee619e Add. Product Id: DELL(tm) Firmware Version: D0DE012 User Capacity: 240,057,409,536 bytes [240 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Nov 25 13:33:49 2020 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled ...
Hdparam shows a very small subset of capabilities:
root@cloudcephosd1012:~# hdparm -I /dev/sdb /dev/sdb: SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ATA device, with non-removable media Standards: Likely used: 1 Configuration: Logical max current cylinders 0 0 heads 0 0 sectors/track 0 0 -- Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 0 MBytes device size with M = 1000*1000: 0 MBytes cache/buffer size = unknown Capabilities: IORDY not likely Cannot perform double-word IO R/W multiple sector transfer: not supported DMA: not supported PIO: pio0
(for example is missing the TRIM command that's available on couldcephosd1003):
# hdparm -I /dev/sdc | grep TRIM * Data Set Management TRIM supported (limit 4 blocks) * Deterministic read ZEROs after TRIM
This limits considerably the ability to use the SSDs in a performant and lasting way.
Solution
The original hosts from a previous order were cloudcephosd100[1-3]. These were all setup with the entirety of ALL disks being set to non-raid mode, presenting as JBOD, and having a software raid1 mirror written to the two smaller SSDs. It appears if the SSDs are put into a hw raid behind the controller, their advanced SSD trim functions are not accessible. The fix was to take cloudcephosd1015 and convert all its disks to non-raid, updated netboot to use ALL eqiad cloudcephosd hosts with software raid, and reimage the host.
cloudcephosd1004:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1004.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1005:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1005.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1006:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1006.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1007:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1007.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1008:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1008.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1009:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1009.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1010:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1010.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1011:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1011.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1012:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1012.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1013:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1013.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1014:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks. (ssh -D 8080 cumin1001.eqiad.wmnet or ssh -L 8000:cloudcephosd1014.mgmt.eqiad.wmnet:443 cumin2001.codfw.wmnet )
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1014.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service
cloudcephosd1015:
- - cloud-services-team depools host from service
- - setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- - ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- - access https://cloudcephosd1015.mgmt.eqiad.wmnet, login to do the following steps:
- - Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- - Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- - Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- - reimage the host with the wmf-auto-reimage-host
- - cloud-services-team returns host to service