Page MenuHomePhabricator

Prometheus doesn't reload or alert on expired client certificates
Open, HighPublic

Description

After discovering a hole in k8s apiserver metrics, @fgiunchedi and I investigated and found that new pki certs had been deployed to prometheus but never picked up, and expired certificats were used, resulting in 401 answered queries for metrics.

Smoking gun from kube-apiserver:

Aug 04 12:34:46 kubemaster1001 kube-apiserver[152161]: E0804 12:34:46.650786  152161 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-08-04T12:34:46Z is after 2023-08-02T08:44:00Z, verifying certificate SN=701251950718436174693962379298597088894617122879, SKID=5F:4D:28:59:E7:F3:A7:B3:9B:9F:F7:65:A0:44:C4:39:BE:A1:82:85, AKID=06:94:D5:26:9E:07:DF:85:0D:DF:92:AC:80:03:53:CC:88:A3:EC:49 failed: x509: certificate has expired or is not yet valid: current time 2023-08-04T12:34:46Z is after 2023-08-02T08:44:00Z]"

A simple reload didn't fix it, so a restart of both prometheus@k8s instances in eqiad was done.

12:32:26         godog │ !log bounce prometheus@k8s on prometheus100[56] to test failure to reload certs

Prometheus should restart on a new certificate deployment, or at least alert on unhealthy jobs caused by 401s.

Event Timeline

The certificates of the wikikube staging clusters have an expiry time of 3 days (and I've tested the hot reloading initially) so this works in general. Maybe some other configuration issue prevented prometheus from reloading when the certificate changed?

Yes I think something went wrong with Prometheus and couldn't reload the certs for whatever reason. In terms of alerting I'm thinking errors on service-discovery on the Prometheus side, and certainly errors related to k8s service discovery.

This happened again on prometheus100[56]

/var/log/syslog.1
Aug 20 15:18:33 prometheus1006 puppet-agent[2698049]: (Cfssl::Cert[wikikube_staging__prometheus]) Scheduling refresh of Exec[prometheus@k8s-staging-reload]
Aug 20 15:18:33 prometheus1006 puppet-agent[2698049]: (/Stage[main]/Profile::Prometheus::K8s/Prometheus::Server[k8s-staging]/Exec[prometheus@k8s-staging-reload]) Triggered 'refresh' from 1 event

/var/log/prometheus/server.log.1
Aug 20 15:18:33 prometheus1006 prometheus@k8s-staging[1040]: level=info ts=2023-08-20T15:18:33.435Z caller=main.go:879 msg="Loading configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml
Aug 20 15:18:33 prometheus1006 prometheus@k8s-staging[1040]: level=info ts=2023-08-20T15:18:33.503Z caller=main.go:910 msg="Completed loading of configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml totalDuration=67.781527ms remote_storage=5.325µs web_handler=1.113µs query_engine=1.796µs scrape=19.67689ms scrape_sd=4.486601ms notify=15.939µs notify_sd=34.816µs rules=16.998914ms



/var/log/syslog.1
Aug 20 15:17:28 prometheus1005 puppet-agent[2889137]: (Cfssl::Cert[wikikube_staging__prometheus]) Scheduling refresh of Exec[prometheus@k8s-staging-reload]
Aug 20 15:17:28 prometheus1005 puppet-agent[2889137]: (/Stage[main]/Profile::Prometheus::K8s/Prometheus::Server[k8s-staging]/Exec[prometheus@k8s-staging-reload]) Triggered 'refresh' from 1 event

/var/log/prometheus/server.log.1 
Aug 20 15:17:28 prometheus1005 prometheus@k8s-staging[1046]: level=info ts=2023-08-20T15:17:28.221Z caller=main.go:879 msg="Loading configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml
Aug 20 15:17:28 prometheus1005 prometheus@k8s-staging[1046]: level=info ts=2023-08-20T15:17:28.254Z caller=main.go:910 msg="Completed loading of configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml totalDuration=32.465262ms remote_storage=4.208µs web_handler=2.386µs query_engine=2.047µs scrape=3.352087ms scrape_sd=5.627388ms notify=13.886µs notify_sd=20.155µs rules=15.314358ms

Mentioned in SAL (#wikimedia-operations) [2023-08-21T09:51:11Z] <jayme> restarted prometheus@k8s on prometheus100[56] - T343529

Sigh, sorry this fell off my radar. I'll implement alerting first so at least we have notifications

Change 951526 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add bandaid alert for prometheus not reloading its k8s certs

https://gerrit.wikimedia.org/r/951526

Change 951526 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add bandaid alert for prometheus not reloading its k8s certs

https://gerrit.wikimedia.org/r/951526

Change 952301 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move KubernetesAPINotScrapable to k8s-specific alerts

https://gerrit.wikimedia.org/r/952301

Change 952301 merged by Filippo Giunchedi:

[operations/alerts@master] sre: move KubernetesAPINotScrapable to k8s-specific alerts

https://gerrit.wikimedia.org/r/952301

Mentioned in SAL (#wikimedia-operations) [2023-10-09T11:51:26Z] <godog> restart k8s-aux in eqiad to pick up new certs - T343529

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:02:52Z] <godog> restart prometheus k8s k8s-aux - T343529

Mentioned in SAL (#wikimedia-operations) [2023-11-13T08:55:46Z] <godog> bounce prometheus eqiad for k8s / k8s-aux - T343529

Since this issue keeps reoccurring we'll have to upgrade prometheus (sth that we need to do anyways at this point)

I gave a quick try at building unstable's prometheus on bullseye (what prometheus hosts run) and it isn't straightforward (due to the dependencies that would need to be backported too). Building for Bookworm seems more straightforward, though we'll also need to upgrade Prometheus hosts to Bookworm (in place) first

Mentioned in SAL (#wikimedia-operations) [2023-11-27T08:41:33Z] <godog> restart prometheus/k8s-staging in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-01-02T08:27:23Z] <jayme> restart prometheus@k8s prometheus@k8s-aux in eqiad - T343529

Prometheus was upgraded as part of T354399: Prometheus @ k8s OOM loop so this task will need monitoring for reoccurrence (hopefully fixed though)

Despite the upgrade, this just happened again on k8s / k8s-aux in eqiad, so more investigation is needed

Mentioned in SAL (#wikimedia-operations) [2024-02-05T14:28:57Z] <godog> bounce prometheus@k8s and @k8s-aux in eqiad - T343529

And again, I just bumped eqiad prometheus@k8s-aux.

Mentioned in SAL (#wikimedia-operations) [2024-02-22T09:03:39Z] <jayme> restart prometheus@k8s in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:06:31Z] <claime> bouncing prometheus@k8s.service - T343529

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:29:52Z] <cdanis> T343529 ✔ cdanis@prometheus2005.codfw.wmnet ~ 🕦☕sudo systemctl restart thanos-sidecar@k8s.service

Mentioned in SAL (#wikimedia-operations) [2024-03-27T13:59:25Z] <godog> bounce prometheus@k8s-aux in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-15T07:48:54Z] <jayme> restarting k8s-mlstaging and k8s-staging prometheus instances - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-15T10:31:44Z] <godog> bounce prometheus@k8s-staging in eqiad - T343529