Page MenuHomePhabricator

[ceph] Metrics started not responding during the drain
Closed, ResolvedPublic

Description

What the prometheus node gets when curling is:

root@prometheus1005:~# curl http://cloudcephmon1001:9283/metrics
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
    <title>503 Service Unavailable</title>
    <style type="text/css">
    #powered_by {
        margin-top: 20px;
        border-top: 2px solid black;
        font-style: italic;
    }

    #traceback {
        color: red;
    }
    </style>
</head>
    <body>
        <h2>503 Service Unavailable</h2>
        <p>Gathering data took 164.92 seconds, metrics are stale for 149.92 seconds, returning "service unavailable".</p>
        <pre id="traceback">Traceback (most recent call last):
  File "/lib/python3/dist-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/lib/python3/dist-packages/cherrypy/lib/encoding.py", line 220, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/lib/python3/dist-packages/cherrypy/_cpdispatch.py", line 60, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1206, in metrics
    return self._metrics(_global_instance)
  File "/usr/share/ceph/mgr/prometheus/module.py", line 1245, in _metrics
    raise cherrypy.HTTPError(503, msg)
cherrypy._cperror.HTTPError: (503, 'Gathering data took 164.92 seconds, metrics are stale for 149.92 seconds, returning "service unavailable".')
</pre>
    <div id="powered_by">
      <span>
        Powered by <a href="http://www.cherrypy.org">CherryPy 8.9.1</a>
      </span>
    </div>
    </body>
</html>

Might just be the extra load, but we should look into it.

Event Timeline

This explains the issue:
https://docs.ceph.com/en/latest/mgr/prometheus/

Turns out that ceph has a cache for metrics that it updates whenever it can, if that cache is older than the scrape interval, you can decide if it should throw error (default) or return the cache.

For us, it seems that it takes > than scrape_interval, but it's still being refreshed sometimes, so I think it's safe to return the cache instead.

ceph config set mgr mgr/prometheus/stale_cache_strategy return
dcaro claimed this task.

I've also set the scrape_interval to the same we have on prometheus side (300), and restarted the mgr (it did not seem to pick up the changes), now it's working again:

root@cloudcephmon1001:~# ceph config set mgr mgr/prometheus/scrape_interval 300

root@cloudcephmon1001:~# ceph config get mgr 
WHO     MASK  LEVEL     OPTION                               VALUE       RO
mgr           advanced  mgr/balancer/active                  true          
mgr           advanced  mgr/balancer/mode                    upmap         
mgr           advanced  mgr/balancer/upmap_max_deviation     2             
mgr           advanced  mgr/prometheus/scrape_interval       300         * 
mgr           advanced  mgr/prometheus/server_port           9283        * 
mgr           advanced  mgr/prometheus/stale_cache_strategy  return      * 
global        advanced  mon_target_pg_per_osd                100           
global        basic     osd_memory_target                    6442450944    
global        advanced  osd_pool_default_pg_autoscale_mode   on
root@prometheus1005:~# curl -v http://cloudcephmon1001:9283/metrics >/dev/null
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wikimediacloud.org,wmnet,127.0.0.1,::1'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 2620:0:861:118:10:64:20:67:9283...
* Connected to cloudcephmon1001 (2620:0:861:118:10:64:20:67) port 9283 (#0)
> GET /metrics HTTP/1.1
> Host: cloudcephmon1001:9283
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain;charset=utf-8
< Server: Ceph-Prometheus
< Date: Thu, 15 Aug 2024 07:06:01 GMT
< Content-Length: 310608
< 
{ [7140 bytes data]
100  303k  100  303k    0     0  37.0M      0 --:--:-- --:--:-- --:--:-- 37.0M
* Connection #0 to host cloudcephmon1001 left intact

Actually, changed the scrape_interval to 60s as that's what we have configured:

root@prometheus1005:~# cat /srv/prometheus/cloud/prometheus.yml | grep -C3 interval
---
global:
  scrape_interval: 60s
  external_labels:
    site: eqiad
    replica: a

Change #1062962 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/alerts@master] ceph: add alert when we get no data from the cluster

https://gerrit.wikimedia.org/r/1062962

Change #1062962 merged by jenkins-bot:

[operations/alerts@master] ceph: add alert when we get no data from the cluster

https://gerrit.wikimedia.org/r/1062962

Change #1063017 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/alerts@master] ceph: use the right metric for unknown alert

https://gerrit.wikimedia.org/r/1063017

Change #1063017 merged by jenkins-bot:

[operations/alerts@master] ceph: use the right metric for unknown alert

https://gerrit.wikimedia.org/r/1063017