⚓ T268746 [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD

	Subject	Repo	Branch	Lines +/-
	cloudcephosd update was not correct	operations/puppet	production	+1 -1
	swapping new cloudcephmon eqiad hosts to partition same as existing	operations/puppet	production	+1 -3

		Status	Subtype	Assigned	Task
		Resolved		dcaro	T268722 Ceph eqiad cluster: osd.44 failing to start
		Resolved		Andrew	T268746 [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD

dcaro created this task.Nov 25 2020, 1:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2020, 1:40 PM

dcaro renamed this task from [ceph] cloud1004-1015 think that their hard drives are HDD when they are SSD to [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.Nov 25 2020, 1:41 PM

dcaro added a parent task: T268722: Ceph eqiad cluster: osd.44 failing to start.

dcaro removed dcaro as the assignee of this task.Nov 25 2020, 2:53 PM

Andrew subscribed.Nov 25 2020, 3:08 PM

related tasks for this hardware: T251619, T242133

<_dcaro> David Caro hmmm... the new servers for ceph (in codfw) have the same brand of disks (a bit smaller size), but they are detected correctly, I think it might be the RAID controller on the other ones that's messing things up

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T18:12:18Z] <andrewbogott> removing all osds from cloudcephosd1015 in order to investigate T268746

I've moved the workload off of cloudcephosd1015.eqiad.wmnet so we can experiment. For starters @RobH is going to upgrade the firmware (including the raid controller), boot back to the OS, and then we'll see what it looks like. If we need to reinstall the OS for it to re-detect the drives that's also fine.

Troubleshooting:

udpated idrac and bios to newest firmware versions, 4.22.00.53 & 2.9.3
raid bios was already latest release 25.5.6.0009
after all updates, the SSD still reports:

robh@cloudcephosd1015:~$ sudo hdparm -I /dev/sdb 

/dev/sdb:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

this is not correct, as cloudcephosd hosts from the purchase before this one are identical in config, but hdparm detects all the features of those SSDs correctly (example: cloudcephosd1002 versus cloudcephosd1015.)
it turns out that the disks in cloudcephmon100[1-3] are all non-raid disks, while the new batch was set in raid mode. converted all to non raid and updating netboot to reimage with non hw raid (sw raid setup) to see if the SSDs then detect correctly.

Change 644578 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] swapping new cloudcephmon eqiad hosts to partition same as existing

https://gerrit.wikimedia.org/r/644578

gerritbot added a project: Patch-For-Review.Dec 1 2020, 6:20 PM

Change 644578 merged by RobH:
[operations/puppet@production] swapping new cloudcephmon eqiad hosts to partition same as existing

https://gerrit.wikimedia.org/r/644578

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011832_robh_7493_cloudcephosd1015_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1015.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1015.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011838_robh_14418_cloudcephosd1015_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1015.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1015.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011839_robh_15155_cloudcephosd1015_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1015.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1015.eqiad.wmnet']

Change 644593 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] cloudcephosd update was not correct

https://gerrit.wikimedia.org/r/644593

Change 644593 merged by RobH:
[operations/puppet@production] cloudcephosd update was not correct

https://gerrit.wikimedia.org/r/644593

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011857_robh_31173_cloudcephosd1015_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1015.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1015.eqiad.wmnet']

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011858_robh_31547_cloudcephosd1015_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1015.eqiad.wmnet']

Of which those FAILED:

['cloudcephosd1015.eqiad.wmnet']

Maintenance_bot removed a project: Patch-For-Review.Dec 1 2020, 7:10 PM

Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts:

cloudcephosd1015.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012011919_robh_20552_cloudcephosd1015_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudcephosd1015.eqiad.wmnet']

and were ALL successful.

detailing the fix in this comment, and then will copy this into the task description.

The original hosts from a previous order were cloudcephosd100[1-3]. These were all setup with the entirety of ALL disks being set to non-raid mode, presenting as JBOD, and having a software raid1 mirror written to the two smaller SSDs. It appears if the SSDs are put into a hw raid behind the controller, their advanced SSD trim functions are not accessible. The fix was to take cloudcephosd1015 and convert all its disks to non-raid, updated netboot to use ALL eqiad cloudcephosd hosts with software raid, and reimage the host.

So the checklist to do this are as follows, using cloudcephosd1004 as an example:

cloudcephosd1004:

- cloud-services-team depools host from service
- setup an ssh tunnel into cumin, so you can pull up the host's https mgmt. If you do this via ssh only it takes a bit longer to convert all the disks.
- ensure system is powered up, as the controller cannot access the disks otherwise. It cannot be in bios, or it cannot make changes.
- access https://cloudcephmon1004.mgmt.eqiad.wmnet, login to do the following steps:
- Configuration > Storage Configuration > Controller Configuration > Reset Configuration
- Configuration > Storage Configuration > Physical Disk Configuration > Drop down next to each disk, convert to non-raid
- Commit all changes via Apply Now, then use the pop up to watch the progress in Job Queue. If you don't see progress, the host may be powered down or in the BIOS, just reboot it and it'll apply the changes.
- reimage the host with the wmf-auto-reimage-host
- cloud-services-team returns host to service

RobH triaged this task as Medium priority.Dec 1 2020, 7:58 PM

RobH updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2020-12-01T20:06:51Z] <andrewbogott> removing all osds on cloudcephosd1014 for rebuild, T268746

Andrew is going to attempt all steps to fix cloudcephosd1014. So reassigning to him for that. If there are issues, I'm around to assist via irc.

RobH unsubscribed.Dec 1 2020, 8:07 PM

• nskaggs subscribed.Dec 1 2020, 8:20 PM

Andrew updated the task description. (Show Details)Dec 1 2020, 8:38 PM

Andrew updated the task description. (Show Details)Dec 1 2020, 8:51 PM

Andrew updated the task description. (Show Details)Dec 1 2020, 9:27 PM

root@cloudcephosd1014:~# hdparm -I /dev/sdc

/dev/sdc:

ATA device, with non-removable media
Model Number: MTFDDAK1T9TDN
Serial Number: 19472511BD26
Firmware Revision: D1DF003
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Used: unknown (minor revision code 0x006d)
Supported: 10 9 8 7 6 5
Likely used: 10
Configuration:
Logical max current
cylinders 16383 0
heads 16 0
sectors/track 63 0

LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 3750748848
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 1831420 MBytes
device size with M = 1000*1000: 1920383 MBytes (1920 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Advanced power management level: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6

	     Cycle time: min=120ns recommended=120ns

PIO: pio0 pio1 pio2 pio3 pio4

	     Cycle time: no flow control=120ns  IORDY flow control=120ns

Commands/features:
Enabled Supported:

SMART feature set
Power Management feature set
Write cache
Look-ahead
WRITE_BUFFER command
READ_BUFFER command
NOP cmd
DOWNLOAD_MICROCODE
Advanced Power Management feature set
48-bit Address feature set
Mandatory FLUSH_CACHE
FLUSH_CACHE_EXT
SMART error logging
SMART self-test
General Purpose Logging feature set
64-bit World wide name
IDLE_IMMEDIATE with UNLOAD

	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	    	unknown 119[6]
	    	unknown 119[8]
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	NCQ priority information
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	   *	DMA Setup Auto-Activate optimization
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	SANITIZE_ANTIFREEZE_LOCK_EXT command
	   *	SANITIZE feature set
	   *	CRYPTO_SCRAMBLE_EXT command
	   *	BLOCK_ERASE_EXT command
	   *	DOWNLOAD MICROCODE DMA command
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM

Logical Unit WWN Device Identifier: 500a07512511bd26
NAA : 5
IEEE OUI : 00a075
Unique ID : 12511bd26
Checksum: correct

Andrew updated the task description. (Show Details)Dec 1 2020, 10:13 PM

Andrew updated the task description. (Show Details)Dec 2 2020, 4:47 AM

Mentioned in SAL (#wikimedia-cloud) [2020-12-02T15:08:42Z] <andrewbogott> removing all osds on cloudcephosd1012 for rebuild, T268746

Andrew updated the task description. (Show Details)Dec 2 2020, 6:46 PM

Mentioned in SAL (#wikimedia-cloud) [2020-12-02T20:03:56Z] <andrewbogott> removing all osds on cloudcephosd1010 for rebuild, T268746

Andrew updated the task description. (Show Details)Dec 2 2020, 9:15 PM

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T02:55:22Z] <andrewbogott> removing all osds on cloudcephosd1009 for rebuild, T268746

Andrew updated the task description. (Show Details)Dec 3 2020, 3:59 AM

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T13:24:15Z] <andrewbogott> removing all osds on cloudcephosd1008 for rebuild, T268746

Andrew updated the task description. (Show Details)Dec 3 2020, 2:10 PM

Andrew updated the task description. (Show Details)Dec 3 2020, 6:16 PM

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T19:51:46Z] <andrewbogott> removing all osds on cloudcephosd1006 for rebuild, T268746

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T21:45:48Z] <andrewbogott> removing all osds on cloudcephosd1005 for rebuild, T268746

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T23:21:32Z] <andrewbogott> removing all osds on cloudcephosd1004 for rebuild, T268746

Andrew updated the task description. (Show Details)Dec 4 2020, 12:26 AM

Everything is rebuild as jbod and put back in service. Looks ok so far!

dcaro awarded a token.Dec 4 2020, 9:07 AM

[ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD
Closed, ResolvedPublic
Actions

Description

Initial Issue Report

Solution

Details

Related Objects
Search...

Event Timeline

[ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSDClosed, ResolvedPublicActions

Description

Initial Issue Report

Solution

Details

Related ObjectsSearch...

Event Timeline

[ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD
Closed, ResolvedPublic
Actions

Related Objects
Search...