Page MenuHomePhabricator

Add prometheus-https load balancer
Closed, ResolvedPublic

Description

To support proper host failover and maintenance for T301944 let's add eqiad/codfw load balancing for TLS port 443 backed by the prometheus hosts.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.
BCornwall changed the task status from In Progress to Stalled.May 1 2023, 4:31 PM
BCornwall removed a project: SRE.

@herron I see there's an unresolved conversation in that patch. Since @Vgutierrez +1ed it before that conversation, I just want to make sure that it is, indeed, ready for merging.

@herron I see there's an unresolved conversation in that patch. Since @Vgutierrez +1ed it before that conversation, I just want to make sure that it is, indeed, ready for merging.

Yes afaik it is ready to go. In terms of HTTPS/TLS the backends are configured to serve https://prometheus-$site.wikimedia.org with SNI, for instance https://prometheus-eqiad.wikimedia.org/ops

Change 863380 merged by BCornwall:

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/863380

Change 929421 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy

https://gerrit.wikimedia.org/r/929421

Change 928942 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Revert "service::catalog: add prometheus-https"

https://gerrit.wikimedia.org/r/928942

Change 928942 merged by BCornwall:

[operations/puppet@production] Revert "service::catalog: add prometheus-https"

https://gerrit.wikimedia.org/r/928942

Mentioned in SAL (#wikimedia-operations) [2023-06-12T22:22:32Z] <brett> Roll restarting pybal on lvs2014 to revert prometheus service rollout - T326657

Mentioned in SAL (#wikimedia-operations) [2023-06-13T08:25:39Z] <vgutierrez> cleaning up prometheus-https service from IPVS on lvs2014 - T326657

Change 929421 merged by BCornwall:

[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy

https://gerrit.wikimedia.org/r/929421

Change 929768 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] prometheus: Add global_cert_name to Envoy config

https://gerrit.wikimedia.org/r/929768

Change 930184 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] tlsproxy::envoy: update support for profile::tlsproxy::envoy::services

https://gerrit.wikimedia.org/r/930184

Change 930185 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] promethus: switch to using cfssl

https://gerrit.wikimedia.org/r/930185

Change 930187 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] promethus: switch to using cfssl

https://gerrit.wikimedia.org/r/930187

Change 930185 abandoned by Jbond:

[operations/puppet@production] promethus: switch to using cfssl

Reason:

See comments

https://gerrit.wikimedia.org/r/930185

Change 930184 merged by Jbond:

[operations/puppet@production] tlsproxy::envoy: update support for profile::tlsproxy::envoy::services

https://gerrit.wikimedia.org/r/930184

Change 929768 abandoned by BCornwall:

[operations/puppet@production] prometheus: Disable SNI support in Envoy tlsproxy

Reason:

I5b10a4a2ad3a34b8ad2ef48052b13c93c62aedd0 supercedes this

https://gerrit.wikimedia.org/r/929768

Change 930187 merged by Herron:

[operations/puppet@production] promethus: switch to using cfssl

https://gerrit.wikimedia.org/r/930187

Change 939326 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/939326

BCornwall changed the task status from Stalled to In Progress.Jul 19 2023, 9:34 PM

Change 939326 merged by Herron:

[operations/puppet@production] service::catalog: add prometheus-https

https://gerrit.wikimedia.org/r/939326

Mentioned in SAL (#wikimedia-operations) [2023-07-20T14:45:41Z] <herron> roll restart codfw/eqiad low-traffic pybals to add prometheus-https T326657

Change 940201 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert

https://gerrit.wikimedia.org/r/940201

Change 940201 merged by Herron:

[operations/puppet@production] prometheus: add prometheus.svc.site.wmnet SANs to cfssl cert

https://gerrit.wikimedia.org/r/940201

I don't think we're done yet, trafficserver is still using hostnames and not prometheus.svc records

Change 948624 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] trafficserver: Use svc urls for eqiad/codfw

https://gerrit.wikimedia.org/r/948624

Change 948624 merged by BCornwall:

[operations/puppet@production] trafficserver: Use svc urls for eqiad/codfw

https://gerrit.wikimedia.org/r/948624

@fgiunchedi Now that this is merged, would you say that this is complete? Thanks for the feedback.

@fgiunchedi Now that this is merged, would you say that this is complete? Thanks for the feedback.

Thank you for following up, I think we're closer: namely at the moment we can have only one host pooled at a time (same issue as T331512). I checked conftool and I think we'll need two distinct services there (i.e. to be able to change prometheus (the existing internal endpoint) and prometheus-https (the web interface) independently:

# confctl select service=prometheus.* get
{"prometheus2005.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus2006.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus1005.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}
{"prometheus1006.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}

@herron is this something you have the bandwidth to take care of?

I think we'll need two distinct services there (i.e. to be able to change prometheus (the existing internal endpoint) and prometheus-https (the web interface) independently:

# confctl select service=prometheus.* get
{"prometheus2005.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus2006.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus1005.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}
{"prometheus1006.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}

Since both ports 80 and 443 are serviced by the same backing apache and prometheus instances what would separate services from confctl perspective gain us in practice?

I think we'll need two distinct services there (i.e. to be able to change prometheus (the existing internal endpoint) and prometheus-https (the web interface) independently:

# confctl select service=prometheus.* get
{"prometheus2005.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus2006.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=prometheus,service=prometheus"}
{"prometheus1005.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}
{"prometheus1006.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=prometheus,service=prometheus"}

Since both ports 80 and 443 are serviced by the same backing apache and prometheus instances what would separate services from confctl perspective gain us in practice?

Port 80 is used for internal read queries from e.g. grafana, and thus benefits from redudancy. Port 443 will be used for SSO access which ATM can function only with one backend host active at a time. The overall issue is tracked at T331512: Support for multiple SSO thanos-web backends which when it gets resolved then we can have the redundancy for 80 and 443

Hi,

Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in

1```
2GET
3https://prometheus-codfw.wikimedia.org/ops/classic/static/css/prometheus.css?v=2.24.1+ds-1+wmf1&ticket=ST-9872-NEB3EeaIw3EVPsJvt2AQUKUvF-Q-idp1002
4[HTTP/2 401 Unauthorized 530ms]
5
6GET
7https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/fuzzy/fuzzy.js?v=2.24.1+ds-1+wmf1&ticket=ST-9873-hF-E6qSnOtJXNg4Sp8oHlwwFNao-idp1002
8[HTTP/2 401 Unauthorized 370ms]
9
10GET
11https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap3-typeahead/bootstrap3-typeahead.js?v=2.24.1+ds-1+wmf1&ticket=ST-9874-kZrzV5v4UXqzGK9QafKv9azkWUM-idp1002
12[HTTP/2 401 Unauthorized 532ms]
13
14GET
15https://prometheus-codfw.wikimedia.org/ops/classic/static/eonasdan-bootstrap-datetimepicker/bootstrap-datetimepicker.min.css?v=2.24.1+ds-1+wmf1&ticket=ST-9875-ZqTBQESIkA7RXzHqZzeJ1oWfBK0-idp1002
16[HTTP/2 401 Unauthorized 538ms]
17
18GET
19https://prometheus-codfw.wikimedia.org/ops/classic/static/rickshaw/rickshaw.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9876-YJ8GzqDKPJ6En1VvALAbxNe5RsM-idp1002
20[HTTP/2 401 Unauthorized 534ms]
21
22GET
23https://prometheus-codfw.wikimedia.org/ops/classic/static/bootstrap4/js/bootstrap.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9877-8WHTT6qODhI-Co47gvjWEAs-gUs-idp1002
24[HTTP/2 401 Unauthorized 290ms]
25
26GET
27https://prometheus-codfw.wikimedia.org/ops/classic/static/popper.js/popper.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9878-s5HbbUQr0DxaoCfqK3Gw8qi4eMY-idp1002
28[HTTP/2 401 Unauthorized 288ms]
29
30GET
31https://prometheus-codfw.wikimedia.org/ops/classic/static/css/graph.css?v=2.24.1+ds-1+wmf1&ticket=ST-9879-eIDaHmDJzEZr976s85bNr-v-Oew-idp1002
32[HTTP/2 401 Unauthorized 536ms]
33
34GET
35https://prometheus-codfw.wikimedia.org/ops/classic/static/rickshaw/rickshaw.min.css?v=2.24.1+ds-1+wmf1&ticket=ST-9880-G55vcRK7bqhPa0jZNqrMiYLDOKo-idp1002
36[HTTP/2 401 Unauthorized 529ms]
37
38GET
39https://prometheus-codfw.wikimedia.org/ops/classic/static/eonasdan-bootstrap-datetimepicker/bootstrap-datetimepicker.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9881-Iw39oWPbte-M-XSVTbaSxNJN6MU-idp1002
40[HTTP/2 401 Unauthorized 541ms]
41
42GET
43https://prometheus-codfw.wikimedia.org/ops/classic/static/mustache/mustache.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9882-5LXFFynBqDnxE7-6ciFd-PO928E-idp1002
44[HTTP/2 401 Unauthorized 535ms]
45
46GET
47https://prometheus-codfw.wikimedia.org/ops/classic/static/js/graph/index.js?v=2.24.1+ds-1+wmf1&ticket=ST-9883-q2sGG-rhmnhONdFxJe8t-gBaekk-idp1002
48[HTTP/2 401 Unauthorized 311ms]
49
50Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/popper.js/popper.min.js?v=2.24.1%2bds-1%2bwmf1”. graph:9:89
51GET
52https://prometheus-codfw.wikimedia.org/ops/classic/static/bootstrap4/js/bootstrap.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9884-u9CgH-6Vo-w2UWx6xkavsoHVROk-idp1002
53[HTTP/2 401 Unauthorized 279ms]
54
55Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/bootstrap4/js/bootstrap.min.js?v=2.24.1%2bds-1%2bwmf1”. graph:10:96
56GET
57https://prometheus-codfw.wikimedia.org/ops/classic/static/css/prometheus.css?v=2.24.1+ds-1+wmf1&ticket=ST-9885-iMSy5rA-M66kMmMTZXwUOIzuB64-idp1002
58[HTTP/2 401 Unauthorized 288ms]
59
60GET
61https://prometheus-codfw.wikimedia.org/ops/classic/static/eonasdan-bootstrap-datetimepicker/bootstrap-datetimepicker.min.css?v=2.24.1+ds-1+wmf1&ticket=ST-9886-ZM6X5tADyQrO0Dc6BFx37gJLKCg-idp1002
62[HTTP/2 401 Unauthorized 597ms]
63
64GET
65https://prometheus-codfw.wikimedia.org/ops/classic/static/rickshaw/rickshaw.min.css?v=2.24.1+ds-1+wmf1&ticket=ST-9887-A0Oe6sVVQZRoU2vYW5ZOSVwud9U-idp1002
66[HTTP/2 401 Unauthorized 357ms]
67
68GET
69https://prometheus-codfw.wikimedia.org/ops/classic/static/rickshaw/rickshaw.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9888-EbZhZTkHgAcw-ozxU4nLvpfQ25M-idp1002
70[HTTP/2 401 Unauthorized 528ms]
71
72Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/rickshaw/rickshaw.min.js?v=2.24.1%2bds-1%2bwmf1”. graph:29:86
73GET
74https://prometheus-codfw.wikimedia.org/ops/classic/static/eonasdan-bootstrap-datetimepicker/bootstrap-datetimepicker.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9889-7i7MnvFIUJJfFAGgczCS2WqoQeE-idp1002
75[HTTP/2 401 Unauthorized 570ms]
76
77Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/eonasdan-bootstrap-datetimepicker/bootstrap-datetimepicker.min.js?v=2.24.1%2bds-1%2bwmf1”. graph:32:127
78GET
79https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap3-typeahead/bootstrap3-typeahead.js?v=2.24.1+ds-1+wmf1&ticket=ST-9890-cg2K4cO2wE-G4dYeQw-fRldHero-idp1002
80[HTTP/2 401 Unauthorized 288ms]
81
82Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap3-typeahead/bootstrap3-typeahead.js?v=2.24.1%2bds-1%2bwmf1”. graph:33:113
83GET
84https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/fuzzy/fuzzy.js?v=2.24.1+ds-1+wmf1&ticket=ST-9891-1AQDVP8vVW5ikNiU1yBeT8qwHhk-idp1002
85[HTTP/2 401 Unauthorized 354ms]
86
87Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/fuzzy/fuzzy.js?v=2.24.1%2bds-1%2bwmf1”. graph:34:83
88GET
89https://prometheus-codfw.wikimedia.org/ops/classic/static/mustache/mustache.min.js?v=2.24.1+ds-1+wmf1&ticket=ST-9892-xfi4fdBGve8YXpNwpgCIZs-OuWw-idp1002
90[HTTP/2 401 Unauthorized 553ms]
91
92Loading failed for the <script> with source “https://prometheus-codfw.wikimedia.org/ops/classic/static/mustache/mustache.min.js?v=2.24.1%2bds-1%2bwmf1”. graph:36:86
93GET
94https://prometheus-codfw.wikimedia.org/ops/classic/static/js/graph/index.js?v=2.24.1+ds-1+wmf1&ticket=ST-9893-a210YRv5cYylWYsK55WnPy5-HJQ-idp1002
95[HTTP/2 401 Unauthorized 553ms]
96
97GET
98https://prometheus-codfw.wikimedia.org/ops/classic/static/css/graph.css?v=2.24.1+ds-1+wmf1&ticket=ST-9894-zVPEDYVTq4bbwixhl0Yead0q8Po-idp1002
99[HTTP/2 401 Unauthorized 640ms]
100
101GET
102https://prometheus-codfw.wikimedia.org/ops/classic/static/img/favicon.ico?v=2.24.1+ds-1+wmf1&ticket=ST-9895-HFgcW91Kk6hC8eE8Pu3wsqPxE70-idp1002
103[HTTP/2 401 Unauthorized 341ms]
104
105GET
106https://idp.wikimedia.org/login?service=https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap4-glyphicons/fonts/glyphicons/glyphicons-halflings-regular.woff2
107CORS Missing Allow Origin
108
109Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://idp.wikimedia.org/login?service=https%3a%2f%2fprometheus-codfw.wikimedia.org%2fops%2fclassic%2fstatic%2fvendor%2fbootstrap4-glyphicons%2ffonts%2fglyphicons%2fglyphicons-halflings-regular.woff2. (Reason: CORS header ‘Access-Control-Allow-Origin’ missing). Status code: 403.
110
111downloadable font: download failed (font-family: "Glyphicons Halflings" style:normal weight:400 stretch:100 src index:1): bad URI or cross-site access not allowed source: https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap4-glyphicons/fonts/glyphicons/glyphicons-halflings-regular.woff2
112GET
113https://idp.wikimedia.org/login?service=https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap4-glyphicons/fonts/glyphicons/glyphicons-halflings-regular.woff
114CORS Missing Allow Origin
115
116Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://idp.wikimedia.org/login?service=https%3a%2f%2fprometheus-codfw.wikimedia.org%2fops%2fclassic%2fstatic%2fvendor%2fbootstrap4-glyphicons%2ffonts%2fglyphicons%2fglyphicons-halflings-regular.woff. (Reason: CORS header ‘Access-Control-Allow-Origin’ missing). Status code: 403.
117
118downloadable font: download failed (font-family: "Glyphicons Halflings" style:normal weight:400 stretch:100 src index:2): bad URI or cross-site access not allowed source: https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap4-glyphicons/fonts/glyphicons/glyphicons-halflings-regular.woff
119GET
120https://idp.wikimedia.org/login?service=https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap4-glyphicons/fonts/glyphicons/glyphicons-halflings-regular.ttf
121CORS Missing Allow Origin
122
123Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://idp.wikimedia.org/login?service=https%3a%2f%2fprometheus-codfw.wikimedia.org%2fops%2fclassic%2fstatic%2fvendor%2fbootstrap4-glyphicons%2ffonts%2fglyphicons%2fglyphicons-halflings-regular.ttf. (Reason: CORS header ‘Access-Control-Allow-Origin’ missing). Status code: 403.
124
125downloadable font: download failed (font-family: "Glyphicons Halflings" style:normal weight:400 stretch:100 src index:3): bad URI or cross-site access not allowed source: https://prometheus-codfw.wikimedia.org/ops/classic/static/vendor/bootstrap4-glyphicons/fonts/glyphicons/glyphicons-halflings-regular.ttf
126```

Change 953207 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] Revert "trafficserver: Use svc urls for eqiad/codfw"

https://gerrit.wikimedia.org/r/953207

Change 953207 merged by Herron:

[operations/puppet@production] Revert "trafficserver: Use svc urls for eqiad/codfw"

https://gerrit.wikimedia.org/r/953207

Hi,

Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in

Thanks for reporting. I reverted the above to stop these errors while we work on a longer term fix.

With oauth2-proxy deployed successfully for thanos.w.o we can deploy it to prometheus too and thus have prometheus-https backed by multiple hosts with SSO

An example task of such migration is https://phabricator.wikimedia.org/T246998, which basically translates to:

  • provision a new oidc client for prometheus in idp
  • introduce a prometheus apache configuration to proxy requests for prometheus-SITE.wikimedia.org to oauth2-proxy
  • configure oauth2-proxy to proxy authenticated requests to prometheus.svc.SITE.wmnet
fgiunchedi raised the priority of this task from Low to Medium.

Change #1061944 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] idp: add prometheus OIDC client

https://gerrit.wikimedia.org/r/1061944

Change #1061944 merged by Filippo Giunchedi:

[operations/puppet@production] idp: add prometheus OIDC client

https://gerrit.wikimedia.org/r/1061944

Change #1062393 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add oauth2-proxy for OIDC authentication

https://gerrit.wikimedia.org/r/1062393

Change #1062393 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add oauth2-proxy for OIDC authentication

https://gerrit.wikimedia.org/r/1062393

Change #1063737 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: fix auth_cas vhost configuration

https://gerrit.wikimedia.org/r/1063737

Change #1063737 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: fix auth_cas vhost configuration

https://gerrit.wikimedia.org/r/1063737

Change #1063761 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: enable oidc auth

https://gerrit.wikimedia.org/r/1063761

Change #1063761 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: enable oidc auth

https://gerrit.wikimedia.org/r/1063761

For now oauth2-proxy can't be reliably installed on bullseye systems because systemd-sysusers segfaults there with long /etc/gshadow entries. See also T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client and T256098: Segfault for systemd-sysusers.service on stat1007

Update after team meeting: I'll be starting the in-place Bookworm upgrade since it'll unblock this issue, it is something we have to do anyways, and I have prometheus host in Pontoon running on Bookworm with no obvious problems.

Mentioned in SAL (#wikimedia-operations) [2024-08-22T09:34:29Z] <godog> start prometheus2006 bookworm upgrade - T326657

I tested the bookworm in place upgrade on prometheus2006 and things seem to be working as expected. I did the following:

depool # only for codfw/eqiad
disable-puppet 'bookworm upgrade'
sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list /etc/apt/sources.list.d/*.list
rm /etc/apt/sources.list.d/repository_puppet.list
apt update  #  the debmonitor update will timeout, that's fine
# first run of upgrades, should run unattended. the debmonitor update may timeout
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"  upgrade
# ditto
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"  dist-upgrade
run-puppet-agent --force
apt update
apt -y upgrade
run-puppet-agent

Then reboot the host via sre.hosts.reboot-single, wait for reboot, then

pool # only for codfw/eqiad

Mentioned in SAL (#wikimedia-operations) [2024-08-23T07:27:16Z] <godog> start prometheus1006 bookworm upgrade - T326657

The two hosts in Bookworm (prometheus2006 and prometheus1006) work well, the only problem I could find is that probes for puppetmaster https endpoints (not puppetserver!) are failing, this is a long-standing issue due to the fact that said endpoints use certs without SAN. Bookworm prometheus-blackbox-exporter has been compiled with newer golang (>= 1.17) which doesn't allow to ignore certs without SANs anymore.

Given the following:

  • puppetmaster hosts are going away
  • if said endpoints are failing then we notice anyways because puppet agent run starts failing

I'm for acking the probefailure alerts for puppetmaster hosts only, given that they are effectively a false positive now. What do you think @jhathaway ?

Change #1066685 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove x509ignoreCN=0 from blackbox exporter

https://gerrit.wikimedia.org/r/1066685

Change #1066685 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove x509ignoreCN=0 from blackbox exporter

https://gerrit.wikimedia.org/r/1066685

I'm for acking the probefailure alerts for puppetmaster hosts only, given that they are effectively a false positive now. What do you think @jhathaway ?

that sounds fine, thanks for asking

I'm for acking the probefailure alerts for puppetmaster hosts only, given that they are effectively a false positive now. What do you think @jhathaway ?

that sounds fine, thanks for asking

sure no problem! it is done as part of T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4)

Mentioned in SAL (#wikimedia-operations) [2024-08-27T11:12:06Z] <godog> start prometheus6002 bookworm upgrade - T326657

Mentioned in SAL (#wikimedia-operations) [2024-08-27T11:20:37Z] <godog> start prometheus7001 bookworm upgrade - T326657

Mentioned in SAL (#wikimedia-operations) [2024-08-27T15:54:44Z] <denisse> Start prometheus4002 Bookworm upgrade - T326657

Mentioned in SAL (#wikimedia-operations) [2024-08-27T16:25:23Z] <denisse> Start prometheus5002 Bookworm upgrade - T326657

Mentioned in SAL (#wikimedia-operations) [2024-08-28T09:40:08Z] <godog> start prometheus1005 bookworm upgrade - T326657

Mentioned in SAL (#wikimedia-operations) [2024-08-28T10:41:32Z] <godog> start prometheus2005 bookworm upgrade - T326657

Mentioned in SAL (#wikimedia-operations) [2024-09-02T12:24:04Z] <godog> enable oidc for prometheus public web interface - T326657

Change #1071815 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: switch prometheus-https service to production

https://gerrit.wikimedia.org/r/1071815

Change #1071816 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] trafficserver: use prometheus svc records for eqiad/codfw

https://gerrit.wikimedia.org/r/1071816

Change #1071815 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: switch prometheus-https service to production

https://gerrit.wikimedia.org/r/1071815

Change #1071816 merged by Filippo Giunchedi:

[operations/puppet@production] trafficserver: use prometheus svc records for eqiad/codfw

https://gerrit.wikimedia.org/r/1071816

This is done, I've set the service as non-paging since we're using it for the prometheus web interface (i.e. humans) whereas the http service is paging since that's for automated access