Page MenuHomePhabricator

Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts
Closed, ResolvedPublic

Description

Ideally both answers are 'no', but regardless I'd like to move the labs instance of Prometheus off cloudmetrics and onto the Prometheus production hardware. My understanding is that this move will help both WMCS (one less component to think about) and Observability (less variance/snowflakes).

Action plan

  • Update cr firewall rules to allow this traffic (T343885#9119388)
  • Add a new profile::prometheus::cloud class (and rename labs to cloud in the process)
  • Allocate space on prometheus LVs (eqiad only) (modules/prometheus/files/provision-fs.sh)
  • Deploy said class to prometheus hosts (eqiad only), making sure alertmanagers parameter is not set so alerts won't be sent.
  • Verify Prometheus can pull metrics from all its jobs.
  • Let metrics data accumulate for TBD days
  • Enable alerts to be sent from prometheus, disable alerts from cloudmetrics
  • Audit accesses on cloudmetrics hosts for Prometheus clients
  • Update cloudmetrics references to point to prometheus.svc/cloud instead

Followups

Event Timeline

lmata triaged this task as Medium priority.Jul 18 2023, 5:38 PM
lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.
lmata subscribed.

This is identified as core work for the year.

Change 963987 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add 'cloud' instance

https://gerrit.wikimedia.org/r/963987

Change 963987 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add 'cloud' instance

https://gerrit.wikimedia.org/r/963987

The Prometheus cloud instance is live at https://prometheus-eqiad.wikimedia.org/cloud .

pdns auth can't be scraped of course (T348437). Also openstack.eqiad1.wikimediacloud.org yields connection refused (unlike scraping from cloudmetrics)

prometheus1005:~$ curl openstack.eqiad1.wikimediacloud.org:12345/metrics
curl: (7) Failed to connect to openstack.eqiad1.wikimediacloud.org port 12345: Connection refused

The openstack exporter was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/960023, looks like the profile was forked before that?

Change 964530 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use openstack_exporter_host in prometheus cloud

https://gerrit.wikimedia.org/r/964530

Change 964530 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use openstack_exporter_host in prometheus cloud

https://gerrit.wikimedia.org/r/964530

The openstack exporter was fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/960023, looks like the profile was forked before that?

Thank you, that was indeed it. We have now parity between prometheus and cloudmetrics prometheus scraping!

Change 964540 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:prometheus::cloud: make openstack deployment a parameter

https://gerrit.wikimedia.org/r/964540

Change 964540 merged by Majavah:

[operations/puppet@production] P:prometheus::cloud: make openstack deployment a parameter

https://gerrit.wikimedia.org/r/964540

pdns auth can't be scraped of course (T348437).

There is an option to control what IP the pdns web server listens on. @taavi does that really need to listen on cloud-private? I thought designate sent updates using traditional DNS NOTFIY / XFR method (i.e. UDP 53 to the auth service)?

One option might be to try to set the webserver-address variable specifically to 0.0.0.0 and see will it listen on all IPs? We can't do that in general with pdns as the recursor and auth service both listen on the same port, but different IPs.

If that won't work I don't think it's unreasonable to use socat or similar to listen on their 10.x WMF IPs and proxy the traffic to the 172.x IP pdns is using.

There is an option to control what IP the pdns web server listens on. @taavi does that really need to listen on cloud-private? I thought designate sent updates using traditional DNS NOTFIY / XFR method (i.e. UDP 53 to the auth service)?

My understanding is that Designate uses both, there are some things (for example deleting zones) that can't be done using normal zone transfers so the api is used for them. The designate pools config file definitely has the api endpoint and credentials and we've seen issues when those are wrong.

One option might be to try to set the webserver-address variable specifically to 0.0.0.0 and see will it listen on all IPs? We can't do that in general with pdns as the recursor and auth service both listen on the same port, but different IPs.

Oh, this indeed seems to work! I'll send a patch for that.

Change 966494 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack::pdns::auth: make pdns web server listen on all IPs

https://gerrit.wikimedia.org/r/966494

Change 967863 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: switch alerts to cloud prometheus

https://gerrit.wikimedia.org/r/967863

Change 967863 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: switch alerts to cloud prometheus

https://gerrit.wikimedia.org/r/967863

Change 967874 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: point prometheus/labs to prometheus hosts

https://gerrit.wikimedia.org/r/967874

Change 966494 merged by Majavah:

[operations/puppet@production] P:openstack::pdns::auth: make pdns web server listen on all IPs

https://gerrit.wikimedia.org/r/966494

Change 967874 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: point prometheus/labs to prometheus hosts

https://gerrit.wikimedia.org/r/967874

Change 967898 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] wmnet: drop cloudmetrics CNAMEs

https://gerrit.wikimedia.org/r/967898

Change 967898 merged by Majavah:

[operations/dns@master] wmnet: drop cloudmetrics CNAMEs

https://gerrit.wikimedia.org/r/967898

Change 968238 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: enable thanos upload for cloud instance

https://gerrit.wikimedia.org/r/968238

Change 968239 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: enable pint for 'cloud' instance

https://gerrit.wikimedia.org/r/968239

Change 968238 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: enable thanos upload for cloud instance

https://gerrit.wikimedia.org/r/968238

Change 968239 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: enable pint for 'cloud' instance

https://gerrit.wikimedia.org/r/968239

Change 968278 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: drop prometheus access for cloudmetrics1003/4

https://gerrit.wikimedia.org/r/968278

Change 968279 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:alertmanager: drop cloudmetrics hosts

https://gerrit.wikimedia.org/r/968279

Change 968280 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::prometheus: drop profile

https://gerrit.wikimedia.org/r/968280

Change 968284 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add cloud replica label

https://gerrit.wikimedia.org/r/968284

Change 968284 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add cloud replica label

https://gerrit.wikimedia.org/r/968284

As far as I'm concerned the prometheus hosts bits of this task are done! There are a bunch of followups , though I think the most important/urgent one would be to decom/clean up cloudmetrics. The rest can be tracked separately (if at all!)

Change 968278 merged by Majavah:

[operations/puppet@production] hieradata: drop prometheus access for cloudmetrics1003/4

https://gerrit.wikimedia.org/r/968278

Change 968279 merged by Majavah:

[operations/puppet@production] P:alertmanager: drop cloudmetrics hosts

https://gerrit.wikimedia.org/r/968279

Change 968280 merged by Majavah:

[operations/puppet@production] P:wmcs::prometheus: drop profile

https://gerrit.wikimedia.org/r/968280

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

I've opened followup tasks for the remaining actions, calling this done!