Page MenuHomePhabricator

flapping icinga Letsencrypt TLS cert alerts around renewal time
Open, HighPublic1 Estimated Story Points

Description

We are getting multiple (new?) Icinga CRITs for the same thing, that TLS cert for cloudelastic.wikimedia.org expires in 7 days.

But these are Letsencrypt certs and it looks like both renewal period is 7 days and monitoring is set to go CRIT at 7 days.

For some reason one of them recovered shortly after but the others have not and after refreshing all 3 in Icinga they are still CRIT.

This does not seem to be an issue with the actual renewal, we saw at least one of them get a new cert as well, but I think there is at least this to fix here:

  • change puppet code so that we don't check the same cert for the same host name on multiple servers? to avoid duplicate alerts?
  • change thresholds so there are no races on the day of renewal (btw the new one it just got will expired on Christmas :)

current status is still like in screenshot below

Screenshot at 2021-10-19 12-50-38.png (113×1 px, 49 KB)

but here is the new cert already, I confirmed that:

[puppetmaster1001:~] $ curl -6 -S -vvv https://cloudelastic.wikimedia.org:9243
* Server certificate:
*  subject: CN=cloudelastic.wikimedia.org
*  start date: Sep 27 19:00:30 2021 GMT
*  expire date: Dec 26 19:00:29 2021 GMT
*  subjectAltName: host "cloudelastic.wikimedia.org" matched cert's "cloudelastic.wikimedia.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Dzahn triaged this task as Low priority.Oct 19 2021, 8:36 PM
Dzahn edited projects, added Discovery-Search; removed cloud-services-team (Kanban).
Dzahn added a subscriber: RKemper.
20:36 <+icinga-wm> ACKNOWLEDGEMENT - WMF Cloud -Psi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org 
                   expires in 7 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). daniel_zahn https://phabricator.wikimedia.org/T293826 
                   https://wikitech.wikimedia.org/wiki/Search%23Administration

ACKed and prio low .. I guess.

additional issue is these alerts are flapping and keep coming back so a simple ACK wasn't enough. downtiming for a day for now

same thing happened today for lists.wikimedia.org, it alerted and then recovered 2 minutes later. In general we have renewal = 7 days and alerting = 7 days.

we should change them all to something like "6 days and 23 hours"

Dzahn renamed this task from cloudelastic icinga TLS cert alerts to flapping icinga Letsencrypt TLS cert alerts around renewal time .Oct 20 2021, 6:19 PM

this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I read yesterday via Planet Debian

Interesting. Lists runs apache, although the cloudelastic hosts look like nginx.

I'm seeing 6 days being logged as well now. I wonder if the 7 days monitoring is coincidental, since that's the threshold we pass to check_http for check_https_lvs_on_port.

Oct 20 19:06:09 alert1001 icinga: SERVICE ALERT: cloudelastic.wikimedia.org;WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS;CRITICAL;SOFT;2;CRITICAL - Certificate 'cloudelastic.wikimedia.org' expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000)

Both the apache and nginx on cloudelastic and lists have been running for months, so fully restarting them seems worth a shot to see if that stops the flapping (for a few weeks/months anyway). I haven't touched them yet in case someone wants to review them in their current state first.

lists.wm.o is serving both the old and new cert, just like the blog post mentioned:

1km@cashew ~> while true; echo -n | openssl s_client -servername lists.wikimedia.org -connect lists.wikimedia.org:443 2>/dev/null | openssl x509 -noout -dates | grep notAfter; end
2notAfter=Oct 28 09:00:44 2021 GMT
3notAfter=Dec 27 09:00:28 2021 GMT
4notAfter=Oct 28 09:00:44 2021 GMT
5notAfter=Dec 27 09:00:28 2021 GMT
6notAfter=Dec 27 09:00:28 2021 GMT
7notAfter=Oct 28 09:00:44 2021 GMT
8notAfter=Oct 28 09:00:44 2021 GMT
9notAfter=Dec 27 09:00:28 2021 GMT
10notAfter=Oct 28 09:00:44 2021 GMT
11notAfter=Oct 28 09:00:44 2021 GMT
12notAfter=Dec 27 09:00:28 2021 GMT
13notAfter=Dec 27 09:00:28 2021 GMT
14notAfter=Oct 28 09:00:44 2021 GMT
15notAfter=Oct 28 09:00:44 2021 GMT
16notAfter=Oct 28 09:00:44 2021 GMT
17notAfter=Oct 28 09:00:44 2021 GMT
18notAfter=Oct 28 09:00:44 2021 GMT
19notAfter=Oct 28 09:00:44 2021 GMT
20notAfter=Oct 28 09:00:44 2021 GMT
21notAfter=Oct 28 09:00:44 2021 GMT
22notAfter=Oct 28 09:00:44 2021 GMT
23notAfter=Oct 28 09:00:44 2021 GMT
24notAfter=Dec 27 09:00:28 2021 GMT
25notAfter=Oct 28 09:00:44 2021 GMT
26notAfter=Oct 28 09:00:44 2021 GMT
27notAfter=Dec 27 09:00:28 2021 GMT
28notAfter=Dec 27 09:00:28 2021 GMT
29notAfter=Dec 27 09:00:28 2021 GMT
30notAfter=Dec 27 09:00:28 2021 GMT
31notAfter=Dec 27 09:00:28 2021 GMT
32notAfter=Dec 27 09:00:28 2021 GMT
33notAfter=Dec 27 09:00:28 2021 GMT
34notAfter=Dec 27 09:00:28 2021 GMT
35notAfter=Dec 27 09:00:28 2021 GMT
36notAfter=Oct 28 09:00:44 2021 GMT
37notAfter=Dec 27 09:00:28 2021 GMT
38notAfter=Dec 27 09:00:28 2021 GMT
39notAfter=Dec 27 09:00:28 2021 GMT
40notAfter=Dec 27 09:00:28 2021 GMT
41notAfter=Dec 27 09:00:28 2021 GMT
42notAfter=Dec 27 09:00:28 2021 GMT
43notAfter=Oct 28 09:00:44 2021 GMT
44notAfter=Dec 27 09:00:28 2021 GMT
45notAfter=Oct 28 09:00:44 2021 GMT
46notAfter=Dec 27 09:00:28 2021 GMT
47notAfter=Dec 27 09:00:28 2021 GMT
48notAfter=Dec 27 09:00:28 2021 GMT
49notAfter=Dec 27 09:00:28 2021 GMT
50notAfter=Oct 28 09:00:44 2021 GMT
51notAfter=Dec 27 09:00:28 2021 GMT
52notAfter=Dec 27 09:00:28 2021 GMT
53notAfter=Dec 27 09:00:28 2021 GMT
54notAfter=Dec 27 09:00:28 2021 GMT
55notAfter=Dec 27 09:00:28 2021 GMT
56notAfter=Oct 28 09:00:44 2021 GMT
57notAfter=Oct 28 09:00:44 2021 GMT
58notAfter=Dec 27 09:00:28 2021 GMT
59notAfter=Dec 27 09:00:28 2021 GMT
60notAfter=Dec 27 09:00:28 2021 GMT
61notAfter=Dec 27 09:00:28 2021 GMT
62notAfter=Dec 27 09:00:28 2021 GMT
63notAfter=Oct 28 09:00:44 2021 GMT
64notAfter=Dec 27 09:00:28 2021 GMT
65notAfter=Dec 27 09:00:28 2021 GMT
66notAfter=Dec 27 09:00:28 2021 GMT
67notAfter=Oct 28 09:00:44 2021 GMT
68notAfter=Dec 27 09:00:28 2021 GMT
69notAfter=Oct 28 09:00:44 2021 GMT
70notAfter=Oct 28 09:00:44 2021 GMT
71notAfter=Oct 28 09:00:44 2021 GMT
72notAfter=Dec 27 09:00:28 2021 GMT
73notAfter=Oct 28 09:00:44 2021 GMT
74notAfter=Oct 28 09:00:44 2021 GMT
75notAfter=Oct 28 09:00:44 2021 GMT
76notAfter=Oct 28 09:00:44 2021 GMT
77notAfter=Dec 27 09:00:28 2021 GMT
78notAfter=Oct 28 09:00:44 2021 GMT
79notAfter=Oct 28 09:00:44 2021 GMT
80notAfter=Dec 27 09:00:28 2021 GMT
81notAfter=Oct 28 09:00:44 2021 GMT
82notAfter=Oct 28 09:00:44 2021 GMT
83notAfter=Oct 28 09:00:44 2021 GMT
84notAfter=Dec 27 09:00:28 2021 GMT
85notAfter=Dec 27 09:00:28 2021 GMT
86notAfter=Dec 27 09:00:28 2021 GMT
87notAfter=Dec 27 09:00:28 2021 GMT
88notAfter=Dec 27 09:00:28 2021 GMT
89notAfter=Dec 27 09:00:28 2021 GMT
90notAfter=Dec 27 09:00:28 2021 GMT
91notAfter=Oct 28 09:00:44 2021 GMT
92notAfter=Dec 27 09:00:28 2021 GMT
93^C⏎

Legoktm raised the priority of this task from Low to High.Oct 21 2021, 4:01 AM

Mentioned in SAL (#wikimedia-operations) [2021-10-21T04:04:02Z] <legoktm> restarted apache on lists1001 so it only uses new TLS cert (T293826)

After restarting apache, and ~50 requests later, I'm only getting the new certificate. Marking as high because the monitoring is picking up a real issue, it's just flappy since ~half the requests pass and half fail.

Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (not sure if that's intentional):

syslog.log-20211009.gz:Oct  8 10:23:33 lists1001 puppet-agent[32311]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211011.gz:Oct 10 22:22:46 lists1001 puppet-agent[22789]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211014.gz:Oct 13 10:23:26 lists1001 puppet-agent[8325]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211016.gz:Oct 15 22:23:11 lists1001 puppet-agent[11417]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211019.gz:Oct 18 11:23:03 lists1001 puppet-agent[30781]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]

(doesn't actually matter, seems like apache reloads config at midnight every day anyways)

Mentioned in SAL (#wikimedia-operations) [2021-10-21T06:35:55Z] <elukey> systemctl reload nginx on cloudelastic100[5,6] to pick up the new TLS certificate and clear alerts - T293826

Change 732611 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] tlsproxy::localssl: acme_chief should notify nginx

https://gerrit.wikimedia.org/r/732611

Maybe it is totally off, but I saw that the cloudelastic nodes use the tlsproxy::localssl define (via elasticsearch::tlsproxy) and it doesn't reload nginx when a new cert is available. Created https://gerrit.wikimedia.org/r/c/operations/puppet/+/732611 as proposal to improve the current situation, I see the same pattern applied to other acme_chief::cert usages.

MPhamWMF set the point value for this task to 1.Oct 25 2021, 3:49 PM
MPhamWMF moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

Change 732611 merged by Elukey:

[operations/puppet@production] tlsproxy::localssl: acme_chief should notify nginx

https://gerrit.wikimedia.org/r/732611

The nginx problems should be fixed with https://gerrit.wikimedia.org/r/732611 in theory (so elastic/cloudelastic/relforge/etc..).

Mentioned in SAL (#wikimedia-operations) [2021-10-21T04:04:02Z] <legoktm> restarted apache on lists1001 so it only uses new TLS cert (T293826)

This one seems sneaky to reproduce, I am wondering if recent versions of httpd have it fixed (upstream is now at 2.4.51, and several changes to mpm-event happened since .38, see https://www.apachelounge.com/Changelog-2.4.html and grep for event).

@Dzahn do you know if the list[12]* nodes are scheduled to be upgraded to Bullseye during the next few weeks? It would be really nice to test the new version of httpd, I think it should solve the issue that @Legoktm brought up.

@elukey Not list* but we could potentially test it with librenms.wikimedia.org. That fulfills the requirements of "uses LE", "has public IP" and "uses httpd" at least, unlike most other things.

Another option could be to use httpd from buster-backports, but https://packages.debian.org/buster-backports/apache2 still lists 2.4.48 (meanwhile on salsa I see .51, that contains a lot of CVE fixes from upstream). If the buster-backport version gets updated to .51 we could install it and avoid bullseye.

Mentioned in SAL (#wikimedia-cloud) [2021-12-06T13:23:49Z] <majavah> root@toolserver-proxy-01:~# systemctl restart apache2.service # working around T293826