Maniphest T219902

Stop using public (cached) endpoints for checks on graphite
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Joe
	Apr 2 2019, 5:19 PM

Description

monitoring::graphite_* checks are set to check by default via https://grafana.wikimedia.org, which means they'll make requests to the edge caches. Since the backend application doesn't send out the correct caching headers all the time (sometimes responses to /render do not include Cache-Control headers, e.g.

https://graphite1004.eqiad.wmnet/render?format=json&from=-25min&until=-10min&target=movingAverage%28eventlogging.overall.inserted.rate%2C+%2210min%22%29

that is the url of one of such checks has no Cache-Control in its response.

As a result, the varnish frontend in eqiad had cached one specific check tonight while it was alerting, and didn't recover even if the metric in graphite did.

The obvious problem is, in this case we're just monitoring something cached in varnish and so basically we perform one check per day or so.

The solution to this problem is to attack the issue from multiple points of view:

Fix graphite's behaviour and send Cache-Control: no-store (better) or Cache-Control: maxage 120 like graphite does for some urls. The former option is implemented in https://gerrit.wikimedia.org/r/c/operations/puppet/+/500729 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/500730 as with UwsgiHandler we can't set headers with the apache filter
Convert progressively all the instances of monitoring::graphite_* to explicitly declare the graphite_url variable, as partially done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/500665 where a graphite_url hiera global was introduced
Audit the check_{prometheus,grafana) puppet resources for similar issues.

Event Timeline

Joe created this task.Apr 2 2019, 5:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2019, 5:19 PM

Fun finding: if we eliminate either the until=Xmin or the from=Xmin we have in the request url for check_graphite we get back Cache-Control: max-age=120.

If we define both, I guess the application logic is that it's a finite amount of time so you don't need to refresh. That would be true if we used dates, but since we're using relative times, the logic is clearly broken and I think it's a good idea to override the behaviour.

For Prometheus, there is just a LVS service IP that goes to local Apache, which on a quick glance does not seem to have any caching modules enabled.
Looking at a curl, Prometheus does not seem to return any cache-control header.

For Grafana, check_grafana hits URLs like https://grafana.wikimedia.org/api/dashboards/uid/000000201 and https://grafana.wikimedia.org/api/alerts, both of which set cache-control: no-cache (and I also see x-cache-status: pass in the response).

So I think Prometheus and Grafana are both working fine in this regard.

I guess we could add Apache directives to add cache-control: no-cache to Prometheus responses if we wanted to be extra paranoid.

I did add the proper caching headers to graphite, so at least now we won't cache checks anymore at the edge. I still think we need to avoid going through the caches, as any unavailability of the edge layer could result in a loss in the ability to check these metrics.

But this is a lower-priority work and I won't engage in it right now.

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 6 2020, 12:11 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Metrics; removed SRE Observability.Aug 9 2021, 2:31 AM

Boldly declining this since graphite is in life support mode and the lowest hanging fruits have been addressed (thanks!)

lmata moved this task from Inbox to Done on the Observability-Metrics board.Jan 16 2023, 5:42 PM

Stop using public (cached) endpoints for checks on graphiteClosed, DeclinedPublicActions

Description

Event Timeline

Stop using public (cached) endpoints for checks on graphite
Closed, DeclinedPublic
Actions