To support proper host failover and maintenance for T301944 let's add eqiad/codfw load balancing for TLS port 443 backed by the prometheus hosts.
Description
Details
Related Objects
- Mentioned In
- T353912: Observability Bookworm upgrades
T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4)
T371087: Configure Prometheus instance centrally
T356386: Move all o11y services to discovery.wmnet - Mentioned Here
- T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4)
T256098: Segfault for systemd-sysusers.service on stat1007
T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client
T246998: Enable SSO for Kibana
P51863 401 errors on prometheus
T331512: Support for multiple SSO thanos-web backends
T301944: Web interface to navigate Prometheus alerts and their status
Event Timeline
@herron I see there's an unresolved conversation in that patch. Since @Vgutierrez +1ed it before that conversation, I just want to make sure that it is, indeed, ready for merging.
Yes afaik it is ready to go. In terms of HTTPS/TLS the backends are configured to serve https://prometheus-$site.wikimedia.org with SNI, for instance https://prometheus-eqiad.wikimedia.org/ops
Change 863380 merged by BCornwall:
[operations/puppet@production] service::catalog: add prometheus-https
Change 929421 had a related patch set uploaded (by BCornwall; author: BCornwall):
[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy
Change 928942 had a related patch set uploaded (by BCornwall; author: BCornwall):
[operations/puppet@production] Revert "service::catalog: add prometheus-https"
Change 928942 merged by BCornwall:
[operations/puppet@production] Revert "service::catalog: add prometheus-https"
Mentioned in SAL (#wikimedia-operations) [2023-06-12T22:22:32Z] <brett> Roll restarting pybal on lvs2014 to revert prometheus service rollout - T326657
Mentioned in SAL (#wikimedia-operations) [2023-06-13T08:25:39Z] <vgutierrez> cleaning up prometheus-https service from IPVS on lvs2014 - T326657
Change 929421 merged by BCornwall:
[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy
Change 929768 had a related patch set uploaded (by BCornwall; author: BCornwall):
[operations/puppet@production] prometheus: Add global_cert_name to Envoy config
Change 930184 had a related patch set uploaded (by Jbond; author: jbond):
[operations/puppet@production] tlsproxy::envoy: update support for profile::tlsproxy::envoy::services
Change 930185 had a related patch set uploaded (by Jbond; author: jbond):
[operations/puppet@production] promethus: switch to using cfssl
Change 930187 had a related patch set uploaded (by Jbond; author: jbond):
[operations/puppet@production] promethus: switch to using cfssl
Change 930185 abandoned by Jbond:
[operations/puppet@production] promethus: switch to using cfssl
Reason:
See comments
Change 930184 merged by Jbond:
[operations/puppet@production] tlsproxy::envoy: update support for profile::tlsproxy::envoy::services
Change 929768 abandoned by BCornwall:
[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy
Reason:
I5b10a4a2ad3a34b8ad2ef48052b13c93c62aedd0 supercedes this
Change 930187 merged by Herron:
[operations/puppet@production] promethus: switch to using cfssl
Change 939326 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] service::catalog: add prometheus-https
Change 939326 merged by Herron:
[operations/puppet@production] service::catalog: add prometheus-https
Mentioned in SAL (#wikimedia-operations) [2023-07-20T14:45:41Z] <herron> roll restart codfw/eqiad low-traffic pybals to add prometheus-https T326657
Change 940201 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert
Change 940201 merged by Herron:
[operations/puppet@production] prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert
I don't think we're done yet, trafficserver is still using hostnames and not prometheus.svc records
@fgiunchedi is that a matter of just updating https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#176 and the -codfw underneath to use prometheus.svc.SITE.wmnet?
Change 948624 had a related patch set uploaded (by BCornwall; author: BCornwall):
[operations/puppet@production] trafficserver: Use svc urls for eqiad/codfw
Change 948624 merged by BCornwall:
[operations/puppet@production] trafficserver: Use svc urls for eqiad/codfw
@fgiunchedi Now that this is merged, would you say that this is complete? Thanks for the feedback.
Thank you for following up, I think we're closer: namely at the moment we can have only one host pooled at a time (same issue as T331512). I checked conftool and I think we'll need two distinct services there (i.e. to be able to change prometheus (the existing internal endpoint) and prometheus-https (the web interface) independently:
# confctl select service=prometheus.* get {"prometheus2005.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"} {"prometheus2006.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"} {"prometheus1005.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"} {"prometheus1006.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}
Since both ports 80 and 443 are serviced by the same backing apache and prometheus instances what would separate services from confctl perspective gain us in practice?
Port 80 is used for internal read queries from e.g. grafana, and thus benefits from redudancy. Port 443 will be used for SSO access which ATM can function only with one backend host active at a time. The overall issue is tracked at T331512: Support for multiple SSO thanos-web backends which when it gets resolved then we can have the redundancy for 80 and 443
Hi,
Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in
Change 953207 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] Revert "trafficserver: Use svc urls for eqiad/codfw"
Change 953207 merged by Herron:
[operations/puppet@production] Revert "trafficserver: Use svc urls for eqiad/codfw"
Thanks for reporting. I reverted the above to stop these errors while we work on a longer term fix.
With oauth2-proxy deployed successfully for thanos.w.o we can deploy it to prometheus too and thus have prometheus-https backed by multiple hosts with SSO
An example task of such migration is https://phabricator.wikimedia.org/T246998, which basically translates to:
- provision a new oidc client for prometheus in idp
- introduce a prometheus apache configuration to proxy requests for prometheus-SITE.wikimedia.org to oauth2-proxy
- configure oauth2-proxy to proxy authenticated requests to prometheus.svc.SITE.wmnet
Change #1061944 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] idp: add prometheus OIDC client
Change #1061944 merged by Filippo Giunchedi:
[operations/puppet@production] idp: add prometheus OIDC client
Change #1062393 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] prometheus: add oauth2-proxy for OIDC authentication
Change #1062393 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add oauth2-proxy for OIDC authentication
Change #1063737 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] prometheus: fix auth_cas vhost configuration
Change #1063737 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: fix auth_cas vhost configuration
Change #1063761 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] prometheus: enable oidc auth
Change #1063761 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: enable oidc auth
For now oauth2-proxy can't be reliably installed on bullseye systems because systemd-sysusers segfaults there with long /etc/gshadow entries. See also T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client and T256098: Segfault for systemd-sysusers.service on stat1007
Update after team meeting: I'll be starting the in-place Bookworm upgrade since it'll unblock this issue, it is something we have to do anyways, and I have prometheus host in Pontoon running on Bookworm with no obvious problems.
Mentioned in SAL (#wikimedia-operations) [2024-08-22T09:34:29Z] <godog> start prometheus2006 bookworm upgrade - T326657
I tested the bookworm in place upgrade on prometheus2006 and things seem to be working as expected. I did the following:
depool # only for codfw/eqiad disable-puppet 'bookworm upgrade' sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list /etc/apt/sources.list.d/*.list rm /etc/apt/sources.list.d/repository_puppet.list apt update # the debmonitor update will timeout, that's fine # first run of upgrades, should run unattended. the debmonitor update may timeout DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" upgrade # ditto DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" dist-upgrade run-puppet-agent --force apt update apt -y upgrade run-puppet-agent
Then reboot the host via sre.hosts.reboot-single, wait for reboot, then
pool # only for codfw/eqiad
Mentioned in SAL (#wikimedia-operations) [2024-08-23T07:27:16Z] <godog> start prometheus1006 bookworm upgrade - T326657
The two hosts in Bookworm (prometheus2006 and prometheus1006) work well, the only problem I could find is that probes for puppetmaster https endpoints (not puppetserver!) are failing, this is a long-standing issue due to the fact that said endpoints use certs without SAN. Bookworm prometheus-blackbox-exporter has been compiled with newer golang (>= 1.17) which doesn't allow to ignore certs without SANs anymore.
Given the following:
- puppetmaster hosts are going away
- if said endpoints are failing then we notice anyways because puppet agent run starts failing
I'm for acking the probefailure alerts for puppetmaster hosts only, given that they are effectively a false positive now. What do you think @jhathaway ?
Change #1066685 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove x509ignoreCN=0 from blackbox exporter
Change #1066685 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove x509ignoreCN=0 from blackbox exporter
sure no problem! it is done as part of T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4)
Mentioned in SAL (#wikimedia-operations) [2024-08-27T11:12:06Z] <godog> start prometheus6002 bookworm upgrade - T326657
Mentioned in SAL (#wikimedia-operations) [2024-08-27T11:20:37Z] <godog> start prometheus7001 bookworm upgrade - T326657
Mentioned in SAL (#wikimedia-operations) [2024-08-27T15:54:44Z] <denisse> Start prometheus4002 Bookworm upgrade - T326657
Mentioned in SAL (#wikimedia-operations) [2024-08-27T16:25:23Z] <denisse> Start prometheus5002 Bookworm upgrade - T326657
Mentioned in SAL (#wikimedia-operations) [2024-08-28T09:40:08Z] <godog> start prometheus1005 bookworm upgrade - T326657
Mentioned in SAL (#wikimedia-operations) [2024-08-28T10:41:32Z] <godog> start prometheus2005 bookworm upgrade - T326657
Mentioned in SAL (#wikimedia-operations) [2024-09-02T12:24:04Z] <godog> enable oidc for prometheus public web interface - T326657
Change #1071815 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] hieradata: switch prometheus-https service to production
Change #1071816 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] trafficserver: use prometheus svc records for eqiad/codfw
Change #1071815 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: switch prometheus-https service to production
Change #1071816 merged by Filippo Giunchedi:
[operations/puppet@production] trafficserver: use prometheus svc records for eqiad/codfw
This is done, I've set the service as non-paging since we're using it for the prometheus web interface (i.e. humans) whereas the http service is paging since that's for automated access