monitoring::graphite_* checks are set to check by default via https://grafana.wikimedia.org, which means they'll make requests to the edge caches. Since the backend application doesn't send out the correct caching headers all the time (sometimes responses to /render do not include Cache-Control headers, e.g.
that is the url of one of such checks has no Cache-Control in its response.
As a result, the varnish frontend in eqiad had cached one specific check tonight while it was alerting, and didn't recover even if the metric in graphite did.
The obvious problem is, in this case we're just monitoring something cached in varnish and so basically we perform one check per day or so.
The solution to this problem is to attack the issue from multiple points of view:
- Fix graphite's behaviour and send Cache-Control: no-store (better) or Cache-Control: maxage 120 like graphite does for some urls. The former option is implemented in https://gerrit.wikimedia.org/r/c/operations/puppet/+/500729 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/500730 as with UwsgiHandler we can't set headers with the apache filter
- Convert progressively all the instances of monitoring::graphite_* to explicitly declare the graphite_url variable, as partially done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/500665 where a graphite_url hiera global was introduced
- Audit the check_{prometheus,grafana) puppet resources for similar issues.