Page MenuHomePhabricator

ceph slow ops 2023-10-11
Closed, ResolvedPublic

Event Timeline

taavi triaged this task as High priority.Oct 11 2023, 11:50 AM
taavi created this task.
11:49:08 <taavi> !log tools reboot tools-sgeexec-10-19

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:04:04Z] <taavi> reboot k8s workers 72, 75, 82 T348634

I found some log entries:

aborrero@cloudcephmon1001:~ $ sudo journalctl | grep "slow request"
[..]
Oct 11 11:39:15 cloudcephmon1001 ceph-mon[818]: osd.213 osd.213 70096 : slow request osd_op(client.708007507.0:23277951 3.43c 3:3c28fde9:::rbd_data.6612a153c4b8d9.0000000000005b0e:head [write 3313664~4096 in=4096b] snapc 3d2=[] ondisk+write+known_if_redirected e39372647) initiated 2023-10-11T11:25:06.023363+0000 currently delayed
Oct 11 11:39:15 cloudcephmon1001 ceph-mon[818]: osd.213 osd.213 70097 : slow request osd_op(client.620994630.0:11346275 3.43c 3:3c21029d:::rbd_data.0e21416b8b4567.0000000000000c01:head [write 1339904~4096 in=4096b] snapc 13f9eb=[13f9eb,fd531] ondisk+write+known_if_redirected e39372957) initiated 2023-10-11T11:31:46.138588+0000 currently delayed

This is for 2 OSDs:

aborrero@cloudcephmon1001:~ $ sudo journalctl | grep "slow request" | awk -F' ' '{print $6}' | sort | uniq
osd.106
osd.213
aborrero@cloudcephmon1001:~ $ sudo ceph osd status | grep ^213
213  cloudcephosd1027  1193G   595G     10      144k     41     2147k  exists,up  
aborrero@cloudcephmon1001:~ $ sudo ceph osd status | grep ^106
106  cloudcephosd1014  1255G   532G     64     2179k     39     2283k  exists,up  

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:11:54Z] <taavi> reboot k8s workers 48, 60, 65, 68, 70, 76 T348634

Found potential SMART disk errors on cloudcephosd1027:

aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | tail -20
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: This message was generated by the smartd daemon running on:
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]:    host name:  cloudcephosd1027
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]:    DNS domain: eqiad.wmnet
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: The following warning/error was logged by the smartd daemon:
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Device: /dev/sdi [SAT], 1080 Offline uncorrectable sectors
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Device info:
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: HFS1T9G32FEH-BA10A, S/N:KSA8N4825I0408C5E, WWN:5-ace42e-025346fa5, FW:DD02, 1.92 TB
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: For details see host's SYSLOG.
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: 
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: You can also use the smartctl utility for further investigation.
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: The original message about this issue was sent at Tue Sep 13 02:57:36 2022 UTC
Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Another message will be sent in 24 hours if the problem persists.

This affect disks:

aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | grep "Device: " | awk -F' ' '{print $7}' | sort | uniq
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
/dev/sdg
/dev/sdh
/dev/sdi
/dev/sdj
taavi claimed this task.

User-facing issues are fixed and immediate issue is over, follow-up is being tracked on subtasks.

Mentioned in SAL (#wikimedia-cloud) [2023-10-11T14:16:30Z] <dcaro> rebooting tools-sgeweblight-10-16 due to stuck NFS (T348634)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T14:24:25Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.grid.reboot_workers for weblight nodes (T348634)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T14:41:37Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.toolforge.grid.reboot_workers (exit_code=99) for weblight nodes (T348634)

dcaro renamed this task from ceph slowdown 2023-10-11 to ceph slow ops 2023-10-11.Oct 12 2023, 2:43 PM
dcaro subscribed.

It's just hapenning again, two osds affected:

root@cloudcephmon1001:~# ceph osd find 111
{
    "osd": 111,
    "addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "10.64.20.65:6828",
                "nonce": 31822
            },
            {
                "type": "v1",
                "addr": "10.64.20.65:6829",
                "nonce": 31822
            }
        ]
    },
    "osd_fsid": "fc1a8902-c160-4338-964a-079e2c32eec8",
    "host": "cloudcephosd1014",
    "crush_location": {
        "host": "cloudcephosd1014",
        "rack": "D5",
        "root": "default"
    }
}
root@cloudcephmon1001:~# ceph osd find 194
{
    "osd": 194,
    "addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "10.64.148.2:6812",
                "nonce": 2077
            },
            {
                "type": "v1",
                "addr": "10.64.148.2:6813",
                "nonce": 2077
            }
        ]
    },
    "osd_fsid": "3d130038-f2e2-4c60-9c30-c3a39668265c",
    "host": "cloudcephosd1025",
    "crush_location": {
        "host": "cloudcephosd1025",
        "rack": "E4",
        "root": "default"
    }
}