Page MenuHomePhabricator

Prometheus: Disable following 3xx redirects by default in at least kubernetes pods scraping
Open, LowPublic

Description

In T305899 we realized that by default prometheus will follow 3xx redirects issued by targets. While this is a pretty ok behavior as a default from an upstream software, it probably isn't what we want to have in our production environments. e.g. in the case described in T305899, what happens is that prometheus will follow the redirect which is to https://toolhub.wikimedia.org/metrics and instead of having scraped the actual instance it tried to scrape, ends up scraping a random pod. That has the following effects:

  • It is impossible to reason about any single instance
  • While, probabilistically speaking and following the law of large numbers, aggregate metrics will probably end up being ok, there is no such guarantee for any metric in isolation. That is per method, uri, view, status code (all of these prometheus labels) etc aggregate stats have a higher chance of just being wrong.
  • Prometheus scraper has the potential to pollute the edge caches with something that most certainly is of no use to end-users. The risk is abysmally small by the way, just adding this for completeness.
  • Prometheus has the potential of scraping old data. This is dependent on how the application being scraped uses HTTP caching headers for that endpoint and very easy to get wrong. As a test for an actual use case, running 100 curl calls from prometheus for https://toolhub.wikimedia.org/metrics returns always x-cache-status: hit-front[1]. So prometheus is probably scraping wrong data quite often.

Upstream made the behavior configurable in https://github.com/prometheus/prometheus/commit/646556a2632700f7fca42cec51d0100294d43c52 which is present post version 2.26.0. We still have 2.24 in our clusters so we aren't there yet.

While toolhub specifically is fixed in T306352, as far as serviceops and kubernetes goes, we would immediately disable this behavior the moment we get the chance.

[1] akosiaris@prometheus1005:~$ curl -s -I https://toolhub.wikimedia.org/metrics |grep x-cache-status

Event Timeline

akosiaris added subscribers: fgiunchedi, colewhite, herron.

We don't have the Prometheus upgrade planned at the minute, however +1 to the above