While looking at ceph dashboards I noticed big and unrealistic spikes up in certain metrics. Looking at the raw data there are two distinct cases:
- For example ceph_pool_rd (from https://grafana.wikimedia.org/goto/mqOovnrHg?orgId=1)
The raw data goes down momentarily (not to zero) and then goes back up, throwing off rate and irate logic which expect non-decreasing counters. Note that this behavior lines up with a node undrain as pointed out by @dcaro : https://sal.toolforge.org/log/BIMm9pgB8tZ8Ohr0F9o5
- In other cases the metric for certain pools goes to zero for a few prometheus scrapes then goes back up (from https://grafana.wikimedia.org/goto/6z3Cv7rHg?orgId=1)
For case 1. I am assuming that ceph-mgr doesn't have the full sum of all counters from OSDs and thus reports a lower count temporarily. This seems the same issue as reported upstream at https://tracker.ceph.com/issues/53190
For case 2. the mgr logs show a restart about at the same time as the "missing" data:
Aug 28 15:49:36 cloudcephmon1004 ceph-mgr[318826]: 2025-08-28T15:49:36.888+0000 7f70c97fa6c0 0 [prometheus INFO cherrypy.access.140124184029200] 2620:0:861:10 2:10:64:16:62 - - [28/Aug/2025:15:49:36] "GET /metrics HTTP/1.1" 200 2773309 "" "Prometheus/2.48.1" Aug 28 15:50:22 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:22.932+0000 7fdadaab7440 1 mgr[py] Loading python module 'prometheus' Aug 28 15:50:23 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:23.148+0000 7fdadaab7440 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member Aug 28 15:50:23 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:23.148+0000 7fdadaab7440 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.268+0000 7fdaa30bd6c0 0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5) Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.272+0000 7fdaa30bd6c0 1 mgr load Constructed class from module: prometheus Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.420+0000 7fda80ff96c0 0 [prometheus INFO root] server_addr: :: server_port: 9283 Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.420+0000 7fda80ff96c0 0 [prometheus INFO root] Cache enabled Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.432+0000 7fda777fe6c0 0 [prometheus INFO root] starting metric collection thread Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.500+0000 7fda80ff96c0 0 [prometheus INFO root] Starting engine... Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.500+0000 7fda80ff96c0 0 [prometheus INFO cherrypy.error] [28/Aug/2025:15:50:26] ENGINE Bus STARTING Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.720+0000 7fda80ff96c0 0 [prometheus INFO cherrypy.error] [28/Aug/2025:15:50:26] ENGINE Serving on http://:::9283 Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.720+0000 7fda80ff96c0 0 [prometheus INFO cherrypy.error] [28/Aug/2025:15:50:26] ENGINE Bus STARTED Aug 28 15:50:26 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:26.720+0000 7fda80ff96c0 0 [prometheus INFO root] Engine started. Aug 28 15:50:32 cloudcephmon1004 ceph-mgr[3384996]: 2620:0:861:101:10:64:0:82 - - [28/Aug/2025:15:50:32] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.4 8.1" Aug 28 15:50:32 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:32.463+0000 7fda2effd6c0 0 [prometheus INFO cherrypy.access.140576699149968] 2620:0:861:1 01:10:64:0:82 - - [28/Aug/2025:15:50:32] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.48.1" Aug 28 15:50:36 cloudcephmon1004 ceph-mgr[3384996]: 2620:0:861:102:10:64:16:62 - - [28/Aug/2025:15:50:36] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2. 48.1" Aug 28 15:50:36 cloudcephmon1004 ceph-mgr[3384996]: 2025-08-28T15:50:36.887+0000 7fda2dffb6c0 0 [prometheus INFO cherrypy.access.140576699149968] 2620:0:861:1 02:10:64:16:62 - - [28/Aug/2025:15:50:36] "GET /metrics HTTP/1.1" 200 1298934 "" "Prometheus/2.48.1"
Note that the content length returned to prometheus is lower (2773309 vs 1298934 bytes) after the restart, I'm assuming because not all data has come in yet.

