Ceph eqiad cluster: osd.44 failing to start
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Nov 25 2020, 8:45 AM

Description

After a change in the autoscale setting, the cluster started adapting to a new pg_num and reporting slow operations on osd.44.

The cluster stabilized on HEALTH_WARNING with some PGs unable to get allocated and osd.44 misbehaving.

Tried restarting the osd.44 service on cloudcephosd1005 and ended up with the service down due to:

● ceph-osd@44.service - Ceph object storage daemon osd.44
   Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
   Active: active (running) since Wed 2020-11-25 08:37:24 UTC; 5min ago
  Process: 7686 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 44 (code=exited, status=0/SUCCESS)
 Main PID: 7690 (ceph-osd)
    Tasks: 59
   Memory: 1.7G
   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@44.service
           └─7690 /usr/bin/ceph-osd -f --cluster ceph --id 44 --setuser ceph --setgroup ceph

Nov 25 08:37:24 cloudcephosd1005 systemd[1]: Starting Ceph object storage daemon osd.44...
Nov 25 08:37:24 cloudcephosd1005 systemd[1]: Started Ceph object storage daemon osd.44.
Nov 25 08:37:30 cloudcephosd1005 ceph-osd[7690]: 2020-11-25 08:37:30.314 7f56c8a01c80 -1 osd.44 106484 log_to_monitors {default=true}
Nov 25 08:37:30 cloudcephosd1005 ceph-osd[7690]: 2020-11-25 08:37:30.322 7f56c8a01c80 -1 osd.44 106484 mon_cmd_maybe_osd_create fail: 'osd.44 has already bound to class 'ssd', can not reset class to 'hdd'; use 'ceph osd crush rm-device-class <id>' to remove old class first': (16) Device or resource busy

The hdd class does not really exist in the cluster (afaics):

root@cloudcephosd1005:/var/lib/ceph/osd/ceph-44# ceph osd crush class ls
[
    "ssd"
]

And the osd.44 is already in the ssd class:

root@cloudcephosd1005:/var/lib/ceph/osd/ceph-44# ceph osd crush get-device-class osd.44
ssd

Tried removing the class and re-adding again for that osd with no changes:

root@cloudcephosd1005:/var/lib/ceph/osd/ceph-44# ceph osd crush rm-device-class osd.44
done removing class of osd(s): 44

root@cloudcephosd1005:/var/lib/ceph/osd/ceph-44# ceph osd crush set-device-class ssd osd.44
set osd(s) 44 to class 'ssd'

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		dcaro	T268722 Ceph eqiad cluster: osd.44 failing to start
		Resolved		Andrew	T268746 [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD

Event Timeline

dcaro created this task.Nov 25 2020, 8:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2020, 8:45 AM

Mentioned in SAL (#wikimedia-cloud) [2020-11-25T08:45:42Z] <_dcaro> Tried resetting the class for osd.44 to ssd, no luck, the cluster is in noout/norebalance to avoid data shuffling (opened T268722)

Mentioned in SAL (#wikimedia-cloud) [2020-11-25T08:54:29Z] <_dcaro> Unsetting noup/nodown to allow re-shuffling of the pgs that osd.44 had, will try to rebuild it (T268722)

Mentioned in SAL (#wikimedia-cloud) [2020-11-25T09:31:19Z] <_dcaro> The OSD seems to be up and running actually, though there's that misleading log, will leave it see if the cluster comes fully healthy (T268722)

dcaro triaged this task as High priority.Nov 25 2020, 9:57 AM

It looks like our drives are reporting the wrong data:

# cat /sys/block/sdd/queue/rotational
1

(ceph checks that here https://github.com/ceph/ceph/blob/25ac1528419371686740412616145703810a561f/src/common/blkdev.cc#L222)

So when starting up, as it has this option enabled (default):

# ceph daemon osd.44 config show | jq ".osd_class_update_on_start"
"true"

It tries to register the osd with the detected class (hdd) and fails, though it continues and succeeds eventually

It can be manually overridden (https://lwn.net/Articles/408428/, https://www.mail-archive.com/ceph-users@ceph.io/msg07631.html), that should also improve the usage of the ssds from the OS

Some extra reading before playing with that setting:
https://wiki.debian.org/SSDOptimization

dcaro added a subtask: T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.Nov 25 2020, 1:41 PM

dcaro moved this task from Inbox to Blocked on the cloud-services-team (Kanban) board.Dec 1 2020, 1:15 PM

Andrew closed subtask T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD as Resolved.Dec 4 2020, 12:28 AM

The error is gone now the the servers are rebuilt with detecting the ssds :)

\o/

Ceph eqiad cluster: osd.44 failing to startClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Ceph eqiad cluster: osd.44 failing to start
Closed, ResolvedPublic
Actions

Related Objects
Search...