Page MenuHomePhabricator

Prometheus use of Squid proxies
Closed, ResolvedPublic

Description

I created the following dashboard to have an overview of what was going through the squid proxies:
https://logstash.wikimedia.org/app/dashboards#/view/58c908a0-a394-11ec-bf8e-43f1807d5bc2

One of the talk talkers are Prometheus hosts. For example prometheus1006 2620:0:861:102:10:64:16:62:
https://logstash.wikimedia.org/goto/2c5622aa81eb384ddeb7dc310634cfb6
Using HTTP CONNECT (so HTTPS) towards (top 5 hits for 1h):

[2620:0:861:ed1a::1] -	1,137
192.0.66.2 - 120   <-expected, external host (automatiic)
en.wikipedia.org - 120
donate.wikimedia.org - 119
208.80.155.12 - 60

Not sure what those are for, but they probably shouldn't use the proxies to reach internal services (unless it's to monitor the proxies themselves).

However feel free to close the task if it's working as expected.

Event Timeline

it looks like the Prometheus connections may be coming from blackbox exporter . The proxy configure was added in 747550 and expanded in 759297. for some of the domains like store.wikimedia.org we would need to use the proxy but for most of them we should be able to connect directly. @herron should be able to add more context on if the proxy is actually needed and/or how easy it would be to update the config

3/5 are text-lb endpoints reaching out to wikis as far as I can tell. Using appservers-ro.discovery.wmnet with the proper "Host: " header would solve at least 2 of those (donate and en) and would allow the checks to not rely on the squid proxy's availability. The first IPv6 address is text-lb too, but I have no idea why we don't have a DNS for it, but I 'd be surprised if we couldn't apply the same logic. As far as automattic goes (foundation website and blog) yeah we need the proxy for that. Finally for payments-listener-eqiad.wikimedia.org (the last IP) my guess is that it just happens but we can always ask fundraising whether it makes sense for them to go via the webproxy (I suspect it doesn't)

The reasoning for checking these via the proxy is because the prometheus hosts can't reach all of the watchrat checked URLs directly, and it's simpler to have one blackbox exporter configuration that uses a proxy and works for all cases than to split the config out between proxied/non-proxied urls. Here's the current config https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/blackbox_exporter/common.yml.erb$25-34

I agree it would be nice to not need the proxy, and IMO also worth considering if it'd be worthwhile for prometheus hosts to have public addresses so this kind of config would do the right thing without proxy.

If it's not causing a problem on the proxies but more a question of why then I think we're ok as configured, but happy to adjust if needed.

Putting aside if we should split the config or provide an external ip address. i wonder if https://wikitech.wikimedia.org/wiki/Url-downloader should be preferred for things like this, @akosiaris?

@herron

I do think it's better to not go through the proxies when not necessary, this in order to:

  • Reduce dependency on 3rd party tool (eg. see T242715 and T300977#7700803 )
  • Reduce unnecessary load on the proxies
  • Reduce the risk of the fetched page to be cached on the proxies

I don't think adding public IPs for Prometheus hosts would scale well (unless there are needs I'm not aware of).

Looking at blackbox_exporter, according to https://github.com/prometheus/blackbox_exporter/issues/34#issuecomment-286419607 it should support the proxy environment variables without the explicit proxy_url.
If so it might also support the no_proxy variable, which could be enough to have it behave the right way.

Otherwise spitting the config to have everything explicit seems like a good option too!

For checks going through the proxies it would be nice to have a parent/child dependency on the proxies themselves as well.

Putting aside if we should split the config or provide an external ip address. i wonder if https://wikitech.wikimedia.org/wiki/Url-downloader should be preferred for things like this, @akosiaris?

Good question. So, I made an effort back in T254011#6181867 (and documented in https://wikitech.wikimedia.org/wiki/Url-downloader by Daniel) to differentiate the 2 sets of proxies. Although the use cases have increased a bit since I first encountered url-downloader back in 2013, the spirit remains the same. Allow services (mediawiki or helping microservices) to reach out to external resources. I 'd argue that monitoring doesn't fall in the mediawiki+siblings paradigm (also known as the Citadel paradigm by dhh[1]). This isn't set in stone or anything of course, but it's probably better to stick to the use cases described here and not add more to url-downloader.

[1] https://m.signalvnoise.com/the-majestic-monolith-can-become-the-citadel/

Change 776878 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Split watchrat URLs by need of proxy usage

https://gerrit.wikimedia.org/r/776878

The reasoning for checking these via the proxy is because the prometheus hosts can't reach all of the watchrat checked URLs directly, and it's simpler to have one blackbox exporter configuration that uses a proxy and works for all cases than to split the config out between proxied/non-proxied urls. Here's the current config https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/blackbox_exporter/common.yml.erb$25-34

I agree it would be nice to not need the proxy, and IMO also worth considering if it'd be worthwhile for prometheus hosts to have public addresses so this kind of config would do the right thing without proxy.

If it's not causing a problem on the proxies but more a question of why then I think we're ok as configured, but happy to adjust if needed.

I 've looked a bit through the code to accustom myself to it. For starters, let me say that not having touched that area in a long while, it was pretty nice to see how this has evolved into a pretty comprehensive blackbox testing framework, well integrated into the rest of our configuration and infrastructure (the per service targets Nice work on that!

Now moving a bit to address the proxied vs non-proxied URLs issue. While I agree on the premise of keeping configuration homogeneous when divergence is not needed, here the picture is a very skewed on the no need to use the proxy side.

I 've run the following (needless to say, very crude and weird) 2 tests from 1 prometheus (I chose eqsin out of a want to be out of my comfort zone) in local clone of our puppet repo host to simulate the scraping prometheus would do to the local prometheus-blackbox-exporter

Simulate using the proxy:

grep '^-' hieradata/common/profile/prometheus/ops.yaml | awk '{print "curl -s \x27http://localhost:9115/probe?module=http_connect_23xx_proxied&target=" $2"\x27 | grep ^probe_success"}'| sh | sort | uniq -c | sort -rn
     25 probe_success 1

Simulate NOT using the proxy:

grep '^-' hieradata/common/profile/prometheus/ops.yaml | awk '{print "curl -s \x27http://localhost:9115/probe?module=https_200_300_connect&target=" $2"\x27 | grep ^probe_success"}'| sh | sort | uniq -c | sort -rn
     23 probe_success 1
      2 probe_success 0

Btw, I was mirroring http_connect_23xx_proxied so I used https_200_300_connect. If we don't care for the return status code (e.g. assume 404 is OK) by using https_connect the picture doesn't change (even if it is more lenient).

Needless to say, only 2 out of the 25 need to use the proxy. Those are (unsuprisingly perhaps):

The former is on shopify, the latter on automattic. prometheus1005 btw has an internal (10.x) IP so that's expected.

Anyway, here's a change for that. https://gerrit.wikimedia.org/r/776878 I expect some relabeling might happen and we might lose some history. I am not sure if we are ok with that or not, please advise.

With that out of the way, those checks still go through all the edge cache layers. While not using the proxy address quite a bit of the comments in T300977#7700803 there are still a few around.

This is a bit more interesting and I 'll have to do some digging and reading to figure out our end goals here, plus it is off topic for this task. That being said, very recently released blackbox exporter[1] adds a hostname parameter. We could use that to point some checks directly to the applayer and not use the edge caches (as we instruct all services to do).

[1]https://github.com/prometheus/blackbox_exporter/blob/v0.20.0/CHANGELOG.md#0200--2022-03-16

Change 788387 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] watchrat: match jobs 'blackbox/watchrat.*'

https://gerrit.wikimedia.org/r/788387

Change 788387 merged by jenkins-bot:

[operations/alerts@master] watchrat: match jobs 'blackbox/watchrat.*'

https://gerrit.wikimedia.org/r/788387

Change 776878 merged by Herron:

[operations/puppet@production] Split watchrat URLs by need of proxy usage

https://gerrit.wikimedia.org/r/776878

Seeing a significant drop in CONNECT (blue) since https://gerrit.wikimedia.org/r/776878 was applied, looking better!

Screen Shot 2022-05-03 at 9.02.08 AM.png (572×1 px, 63 KB)

jbond triaged this task as Medium priority.May 3 2022, 1:26 PM
herron claimed this task.