Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | dcaro | T334240 [cloudceph] Slow operations - tracking task | |||
Resolved | taavi | T348634 ceph slow ops 2023-10-11 | |||
In Progress | dcaro | T348643 cloudcephosd1021-1034: hard drive sector errors increasing | |||
Resolved | taavi | T349425 CephSlowOps Ceph cluster in has slow ops, which might be blocking some writes | |||
Resolved | taavi | T352570 CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes | |||
In Progress | dcaro | T348716 [ceph] export number of bad sectors per-disk | |||
Open | dcaro | T349694 [ceph] Enable disk failure prediciton | |||
Open | None | T348633 [api-gateway] add alert for uptime |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:01:48Z] <taavi> reboot tools-sgecron-2 T348634
Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:04:04Z] <taavi> reboot k8s workers 72, 75, 82 T348634
I found some log entries:
aborrero@cloudcephmon1001:~ $ sudo journalctl | grep "slow request" [..] Oct 11 11:39:15 cloudcephmon1001 ceph-mon[818]: osd.213 osd.213 70096 : slow request osd_op(client.708007507.0:23277951 3.43c 3:3c28fde9:::rbd_data.6612a153c4b8d9.0000000000005b0e:head [write 3313664~4096 in=4096b] snapc 3d2=[] ondisk+write+known_if_redirected e39372647) initiated 2023-10-11T11:25:06.023363+0000 currently delayed Oct 11 11:39:15 cloudcephmon1001 ceph-mon[818]: osd.213 osd.213 70097 : slow request osd_op(client.620994630.0:11346275 3.43c 3:3c21029d:::rbd_data.0e21416b8b4567.0000000000000c01:head [write 1339904~4096 in=4096b] snapc 13f9eb=[13f9eb,fd531] ondisk+write+known_if_redirected e39372957) initiated 2023-10-11T11:31:46.138588+0000 currently delayed
This is for 2 OSDs:
aborrero@cloudcephmon1001:~ $ sudo journalctl | grep "slow request" | awk -F' ' '{print $6}' | sort | uniq osd.106 osd.213 aborrero@cloudcephmon1001:~ $ sudo ceph osd status | grep ^213 213 cloudcephosd1027 1193G 595G 10 144k 41 2147k exists,up aborrero@cloudcephmon1001:~ $ sudo ceph osd status | grep ^106 106 cloudcephosd1014 1255G 532G 64 2179k 39 2283k exists,up
Mentioned in SAL (#wikimedia-cloud) [2023-10-11T12:11:54Z] <taavi> reboot k8s workers 48, 60, 65, 68, 70, 76 T348634
Found potential SMART disk errors on cloudcephosd1027:
aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | tail -20 Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: This message was generated by the smartd daemon running on: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: host name: cloudcephosd1027 Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: DNS domain: eqiad.wmnet Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: The following warning/error was logged by the smartd daemon: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Device: /dev/sdi [SAT], 1080 Offline uncorrectable sectors Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Device info: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: HFS1T9G32FEH-BA10A, S/N:KSA8N4825I0408C5E, WWN:5-ace42e-025346fa5, FW:DD02, 1.92 TB Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: For details see host's SYSLOG. Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: You can also use the smartctl utility for further investigation. Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: The original message about this issue was sent at Tue Sep 13 02:57:36 2022 UTC Oct 11 09:18:01 cloudcephosd1027 smart_failure[2819002]: Another message will be sent in 24 hours if the problem persists.
This affect disks:
aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | grep "Device: " | awk -F' ' '{print $7}' | sort | uniq /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
User-facing issues are fixed and immediate issue is over, follow-up is being tracked on subtasks.
Mentioned in SAL (#wikimedia-cloud) [2023-10-11T14:16:30Z] <dcaro> rebooting tools-sgeweblight-10-16 due to stuck NFS (T348634)
Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T14:24:25Z] <wm-bot2> dcaro@urcuchillay START - Cookbook wmcs.toolforge.grid.reboot_workers for weblight nodes (T348634)
Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T14:41:37Z] <wm-bot2> dcaro@urcuchillay END (FAIL) - Cookbook wmcs.toolforge.grid.reboot_workers (exit_code=99) for weblight nodes (T348634)
It's just hapenning again, two osds affected:
root@cloudcephmon1001:~# ceph osd find 111 { "osd": 111, "addrs": { "addrvec": [ { "type": "v2", "addr": "10.64.20.65:6828", "nonce": 31822 }, { "type": "v1", "addr": "10.64.20.65:6829", "nonce": 31822 } ] }, "osd_fsid": "fc1a8902-c160-4338-964a-079e2c32eec8", "host": "cloudcephosd1014", "crush_location": { "host": "cloudcephosd1014", "rack": "D5", "root": "default" } } root@cloudcephmon1001:~# ceph osd find 194 { "osd": 194, "addrs": { "addrvec": [ { "type": "v2", "addr": "10.64.148.2:6812", "nonce": 2077 }, { "type": "v1", "addr": "10.64.148.2:6813", "nonce": 2077 } ] }, "osd_fsid": "3d130038-f2e2-4c60-9c30-c3a39668265c", "host": "cloudcephosd1025", "crush_location": { "host": "cloudcephosd1025", "rack": "E4", "root": "default" } }