Page MenuHomePhabricator

Investigate big spikes up in wmcs ceph dashboards
Open, MediumPublic

Description

While looking at ceph dashboards I noticed big and unrealistic spikes up in certain metrics. Looking at the raw data there are two distinct cases:

  1. For example ceph_pool_rd (from https://grafana.wikimedia.org/goto/mqOovnrHg?orgId=1)

2025-09-01-133932_2456x960_scrot.png (960×2 px, 107 KB)

The raw data goes down momentarily (not to zero) and then goes back up, throwing off rate and irate logic which expect non-decreasing counters. Note that this behavior lines up with a node undrain as pointed out by @dcaro : https://sal.toolforge.org/log/BIMm9pgB8tZ8Ohr0F9o5

  1. In other cases the metric for certain pools goes to zero for a few prometheus scrapes then goes back up (from https://grafana.wikimedia.org/goto/6z3Cv7rHg?orgId=1)

2025-09-01-134757_1658x721_scrot.png (721×1 px, 70 KB)

For case 1. I am assuming that ceph-mgr doesn't have the full sum of all counters from OSDs and thus reports a lower count temporarily. This seems the same issue as reported upstream at https://tracker.ceph.com/issues/53190

For case 2. the mgr logs show a restart about at the same time as the "missing" data:

Aug 28 15:49:36 cloudcephmon1004 ceph-mgr[318826]: 2025-08-28T15:49:36.888+0000 7f70c97fa6c0  0 [prometheus INFO cherrypy.access.140124184029200] 2620:0:861:10
2:10:64:16:62 - - [28/Aug/2025:15:49:36] "GET /metrics HTTP/1.1" 200 2773309 "" "Prometheus/2.48.1"
Aug 28 15:50:22 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:22.932+0000 7fdadaab7440  1 mgr[py] Loading python module 'prometheus'
Aug 28 15:50:23 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:23.148+0000 7fdadaab7440 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Aug 28 15:50:23 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:23.148+0000 7fdadaab7440 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.268+0000 7fdaa30bd6c0  0 [prometheus DEBUG root] setting log level based on debug_mgr:
INFO (2/5)
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.272+0000 7fdaa30bd6c0  1 mgr load Constructed class from module: prometheus
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.420+0000 7fda80ff96c0  0 [prometheus INFO root] server_addr: :: server_port: 9283
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.420+0000 7fda80ff96c0  0 [prometheus INFO root] Cache enabled
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.432+0000 7fda777fe6c0  0 [prometheus INFO root] starting metric collection thread
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.500+0000 7fda80ff96c0  0 [prometheus INFO root] Starting engine...
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.500+0000 7fda80ff96c0  0 [prometheus INFO cherrypy.error] [28/Aug/2025:15:50:26] ENGINE
 Bus STARTING
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.720+0000 7fda80ff96c0  0 [prometheus INFO cherrypy.error] [28/Aug/2025:15:50:26] ENGINE
 Serving on http://:::9283
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.720+0000 7fda80ff96c0  0 [prometheus INFO cherrypy.error] [28/Aug/2025:15:50:26] ENGINE
 Bus STARTED
Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.720+0000 7fda80ff96c0  0 [prometheus INFO root] Engine started.
Aug 28 15:50:32 cloudcephmon1004 ceph-mgr[3384996]: 2620:0:861:101:10:64:0:82 - - [28/Aug/2025:15:50:32] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.4
8.1"
Aug 28 15:50:32 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:32.463+0000 7fda2effd6c0  0 [prometheus INFO cherrypy.access.140576699149968] 2620:0:861:1
01:10:64:0:82 - - [28/Aug/2025:15:50:32] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.48.1"
Aug 28 15:50:36 cloudcephmon1004 ceph-mgr[3384996]: 2620:0:861:102:10:64:16:62 - - [28/Aug/2025:15:50:36] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.
48.1"
Aug 28 15:50:36 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:36.887+0000 7fda2dffb6c0  0 [prometheus INFO cherrypy.access.140576699149968] 2620:0:861:1
02:10:64:16:62 - - [28/Aug/2025:15:50:36] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.48.1"

Note that the content length returned to prometheus is lower (2773309 vs 1298934 bytes) after the restart, I'm assuming because not all data has come in yet.

Event Timeline

Something worth noting from https://docs.ceph.com/en/latest/mgr/prometheus/ is that scrape_interval is set to 15s, whereas in production the effective scrape interval is 30s (60s / 2 prometheus hosts)

Something worth noting from https://docs.ceph.com/en/latest/mgr/prometheus/ is that scrape_interval is set to 15s, whereas in production the effective scrape interval is 30s (60s / 2 prometheus hosts)

I slightly remember at some point we were having issues with prometheus overwhelming the mgr process when scraping the stats, and we tweaked the scrape interval, might not be an issue now (or might be the cause of the restart :/).

re: case 2, mgr was restarted, apparently manually, and IIRC it was during the ceph outage

Aug 28 15:50:20 cloudcephmon1004 ceph-mgr[318826]: 2025-08-28T15:50:20.592+0000 7f71217fa6c0  0 [progress INFO root] Updated progress to 0.6666666666666666 Reb
alancing after osd.175 marked in (15s)
Aug 28 15:50:21 cloudcephmon1004 systemd[1]: Stopping ceph-mgr@cloudcephmon1004.service - Ceph cluster manager daemon...
Aug 28 15:50:21 cloudcephmon1004 systemd[1]: ceph-mgr@cloudcephmon1004.service: Deactivated successfully.
Aug 28 15:50:21 cloudcephmon1004 systemd[1]: Stopped ceph-mgr@cloudcephmon1004.service - Ceph cluster manager daemon.
Aug 28 15:50:21 cloudcephmon1004 systemd[1]: ceph-mgr@cloudcephmon1004.service: Consumed 10h 1min 12.100s CPU time.
Aug 28 15:50:21 cloudcephmon1004 systemd[1]: Started ceph-mgr@cloudcephmon1004.service - Ceph cluster manager daemon.
Aug 28 15:50:21 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:21.752+0000 7fdadaab7440  0 set uid:gid to 64045:64045 (ceph:ceph)
Aug 28 15:50:21 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:21.752+0000 7fdadaab7440  0 ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516
) pacific (stable), process ceph-mgr, pid 3384996

That's actually a relief :)

fgiunchedi triaged this task as Medium priority.Sep 1 2025, 2:31 PM