Page MenuHomePhabricator

Stop using public (cached) endpoints for checks on graphite
Closed, DeclinedPublic

Description

monitoring::graphite_* checks are set to check by default via https://grafana.wikimedia.org, which means they'll make requests to the edge caches. Since the backend application doesn't send out the correct caching headers all the time (sometimes responses to /render do not include Cache-Control headers, e.g.

https://graphite1004.eqiad.wmnet/render?format=json&from=-25min&until=-10min&target=movingAverage%28eventlogging.overall.inserted.rate%2C+%2210min%22%29

that is the url of one of such checks has no Cache-Control in its response.

As a result, the varnish frontend in eqiad had cached one specific check tonight while it was alerting, and didn't recover even if the metric in graphite did.

The obvious problem is, in this case we're just monitoring something cached in varnish and so basically we perform one check per day or so.

The solution to this problem is to attack the issue from multiple points of view:

Event Timeline

Fun finding: if we eliminate either the until=Xmin or the from=Xmin we have in the request url for check_graphite we get back Cache-Control: max-age=120.

If we define both, I guess the application logic is that it's a finite amount of time so you don't need to refresh. That would be true if we used dates, but since we're using relative times, the logic is clearly broken and I think it's a good idea to override the behaviour.

For Prometheus, there is just a LVS service IP that goes to local Apache, which on a quick glance does not seem to have any caching modules enabled.
Looking at a curl, Prometheus does not seem to return any cache-control header.

For Grafana, check_grafana hits URLs like https://grafana.wikimedia.org/api/dashboards/uid/000000201 and https://grafana.wikimedia.org/api/alerts, both of which set cache-control: no-cache (and I also see x-cache-status: pass in the response).

So I think Prometheus and Grafana are both working fine in this regard.

I guess we could add Apache directives to add cache-control: no-cache to Prometheus responses if we wanted to be extra paranoid.

Joe triaged this task as Medium priority.Apr 4 2019, 1:15 PM

I did add the proper caching headers to graphite, so at least now we won't cache checks anymore at the edge. I still think we need to avoid going through the caches, as any unavailability of the edge layer could result in a loss in the ability to check these metrics.

But this is a lower-priority work and I won't engage in it right now.

fgiunchedi subscribed.

Boldly declining this since graphite is in life support mode and the lowest hanging fruits have been addressed (thanks!)