ceph slow ops 2023-10-11
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	taavi
	Oct 11 2023, 11:50 AM

Related Objects
Search...

Status	Assigned	Task
Open	dcaro	T334240 [cloudceph] Slow operations - tracking task
Resolved	taavi	T348634 ceph slow ops 2023-10-11
In Progress	dcaro	T348643 cloudcephosd1021-1034: hard drive sector errors increasing
Resolved	taavi	T349425 CephSlowOps Ceph cluster in has slow ops, which might be blocking some writes
Resolved	taavi	T352570 CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes
In Progress	dcaro	T348716 [ceph] export number of bad sectors per-disk
Open	dcaro	T349694 [ceph] Enable disk failure prediciton
Open	None	T348633 [api-gateway] add alert for uptime

Event Timeline

taavi triaged this task as High priority.Oct 11 2023, 11:50 AM

taavi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2023, 11:50 AM

11:49:08 <taavi> !log tools reboot tools-sgeexec-10-19

Sportzpikachu subscribed.Oct 11 2023, 11:57 AM

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:01:48Z] <taavi> reboot tools-sgecron-2 T348634

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:04:04Z] <taavi> reboot k8s workers 72, 75, 82 T348634

I found some log entries:

aborrero@cloudcephmon1001:~ $ sudo journalctl | grep "slow request"
[..]
Oct 11 11:39:15 cloudcephmon1001 ceph-mon[818]: osd.213 osd.213 70096 : slow request osd_op(client.708007507.0:23277951 3.43c 3:3c28fde9:::rbd_data.6612a153c4b8d9.0000000000005b0e:head [write 3313664~4096 in=4096b] snapc 3d2=[] ondisk+write+known_if_redirected e39372647) initiated 2023-10-11T11:25:06.023363+0000 currently delayed
Oct 11 11:39:15 cloudcephmon1001 ceph-mon[818]: osd.213 osd.213 70097 : slow request osd_op(client.620994630.0:11346275 3.43c 3:3c21029d:::rbd_data.0e21416b8b4567.0000000000000c01:head [write 1339904~4096 in=4096b] snapc 13f9eb=[13f9eb,fd531] ondisk+write+known_if_redirected e39372957) initiated 2023-10-11T11:31:46.138588+0000 currently delayed

This is for 2 OSDs:

aborrero@cloudcephmon1001:~ $ sudo journalctl | grep "slow request" | awk -F' ' '{print $6}' | sort | uniq
osd.106
osd.213
aborrero@cloudcephmon1001:~ $ sudo ceph osd status | grep ^213
213  cloudcephosd1027  1193G   595G     10      144k     41     2147k  exists,up  
aborrero@cloudcephmon1001:~ $ sudo ceph osd status | grep ^106
106  cloudcephosd1014  1255G   532G     64     2179k     39     2283k  exists,up

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:11:54Z] <taavi> reboot k8s workers 48, 60, 65, 68, 70, 76 T348634

Found potential SMART disk errors on cloudcephosd1027:

aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | tail -20
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: This message was generated by the smartd daemon running on:
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]:    host name:  cloudcephosd1027
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]:    DNS domain: eqiad.wmnet
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: The following warning/error was logged by the smartd daemon:
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Device: /dev/sdi [SAT], 1080 Offline uncorrectable sectors
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Device info:
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: HFS1T9G32FEH-BA10A, S/N:KSA8N4825I0408C5E, WWN:5-ace42e-025346fa5, FW:DD02, 1.92 TB
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: For details see host's SYSLOG.
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: You can also use the smartctl utility for further investigation.
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: The original message about this issue was sent at Tue Sep 13 02:57:36 2022 UTC
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Another message will be sent in 24 hours if the problem persists.

This affect disks:

aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | grep "Device: " | awk -F' ' '{print $7}' | sort | uniq
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg
/dev/sdh
/dev/sdi
/dev/sdj

User-facing issues are fixed and immediate issue is over, follow-up is being tracked on subtasks.

taavi added a subtask: T348633: [api-gateway] add alert for uptime.Oct 11 2023, 1:40 PM

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T14:16:30Z] <dcaro> rebooting tools-sgeweblight-10-16 due to stuck NFS (T348634)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T14:24:25Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.grid.reboot_workers for weblight nodes (T348634)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T14:41:37Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.toolforge.grid.reboot_workers (exit_code=99) for weblight nodes (T348634)

It's just hapenning again, two osds affected:

root@cloudcephmon1001:~# ceph osd find 111
{
    "osd": 111,
    "addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "10.64.20.65:6828",
                "nonce": 31822
            },
            {
                "type": "v1",
                "addr": "10.64.20.65:6829",
                "nonce": 31822
            }
        ]
    },
    "osd_fsid": "fc1a8902-c160-4338-964a-079e2c32eec8",
    "host": "cloudcephosd1014",
    "crush_location": {
        "host": "cloudcephosd1014",
        "rack": "D5",
        "root": "default"
    }
}
root@cloudcephmon1001:~# ceph osd find 194
{
    "osd": 194,
    "addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "10.64.148.2:6812",
                "nonce": 2077
            },
            {
                "type": "v1",
                "addr": "10.64.148.2:6813",
                "nonce": 2077
            }
        ]
    },
    "osd_fsid": "3d130038-f2e2-4c60-9c30-c3a39668265c",
    "host": "cloudcephosd1025",
    "crush_location": {
        "host": "cloudcephosd1025",
        "rack": "E4",
        "root": "default"
    }
}

fnegri changed the status of subtask T348643: cloudcephosd1021-1034: hard drive sector errors increasing from Open to In Progress.Jan 8 2024, 5:55 PM

ceph slow ops 2023-10-11Closed, ResolvedPublicActions

Related ObjectsSearch...

Event Timeline

ceph slow ops 2023-10-11
Closed, ResolvedPublic
Actions

Related Objects
Search...