[cloudceph] Slow operations - tracking task
Open, HighPublic
Actions

Assigned To

Authored By

	dcaro
	Apr 6 2023, 4:52 PM

Description

We get sometimes slow operations, and that correlates with VMs getting "stuck" while those happen.

This task is to track such events for long-term debugging and add any ideas or improvement made during the span of the investigation.

2024-02-07

Slow operations brought NFS down again

possibly related to T356904: [cinder] [toolsdb] Deleting snapshot does not work (to be verified)

TO BE FILLED UP

2023-04-04

From IRC:

<taavi> Taavi Väänänen tools-puppetdb-1 seems down
12:29:23 ceph is in HEALTH_WARN
12:29:25 dcaro:
12:29:40 <dcaro> David Caro oh, let me check
12:30:44 getting slow ops
12:30:47 https://www.irccloud.com/pastebin/J3vS2U5e/
---
root@cloudcephmon1001:~# ceph health detail
HEALTH_WARN 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops
[WRN] SLOW_OPS: 8 slow ops, oldest one blocked for 551 sec, osd.218 has slow ops
---

12:32:36 the host is heavily writing, though it does not seem excessive
12:37:57 it went away
12:38:15 the slow ops were all "waiting for readable"
hm
12:39:59 this is not the first time we see slow ops randomly appearing, and they are taking down some VMs, what can we do to prevent this happening in the future?
12:40:45 also the alert needs adjusting as it says 'things should still be working as expected'
12:42:16 <dcaro> David Caro and they should according to ceph, writes just take longer, but they happen, maybe that's not acceptable at the libvirt level though
12:43:28 this is the list of clients with stuck operations at that point:
12:43:39 https://www.irccloud.com/pastebin/L6vusXGQ/
---
root@cloudcephosd1028:~# <lolo jq -r .ops[].type_data.client_info.client_addr | cut -d: -f1 | sort | uniq -c | sort -n
      1 10.64.148.7
      1 10.64.20.24
      1 10.64.20.50
      1 10.64.20.6
      1 10.64.20.9
      2 10.64.149.7
      2 10.64.149.9
      2 10.64.20.51
      2 10.64.20.7
      2 10.64.20.80
      3 10.64.148.8
      3 10.64.20.73
      4 10.64.20.12
      4 10.64.20.78
      4 10.64.20.8
      5 10.64.20.53
      5 10.64.20.75
      7 10.64.20.81
     12 10.64.148.9
    106 10.64.20.54
---
12:43:49 cloudvirt1030 being the one with more there
12:47:45 answering your question, I think that to prevent this we have to find the cause/causes/bottleneck there
13:03:15 taavi: looking at ceph status changes there's not too many warning status flaps, though some
13:03:17 https://usercontent.irccloud-cdn.com/file/MxnCGwAM/image.png

{F36941730}

13:03:53 some are also when I was tweaking the rack HA, but some are not
13:04:53 can you create a task to keep track of this issues?

Details

	Subject	Repo	Branch	Lines +/-
	ceph/rbd: slow down snap trimming on ssds, corrected	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	dcaro	T334240 [cloudceph] Slow operations - tracking task
Resolved	• taavi	T348634 ceph slow ops 2023-10-11
In Progress	dcaro	T348643 cloudcephosd1021-1034: hard drive sector errors increasing
Resolved	• taavi	T349425 CephSlowOps Ceph cluster in has slow ops, which might be blocking some writes
Resolved	• taavi	T352570 CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes
In Progress	dcaro	T348716 [ceph] export number of bad sectors per-disk
Open	dcaro	T349694 [ceph] Enable disk failure prediciton
Open	None	T348633 [api-gateway] add alert for uptime