Page MenuHomePhabricator

flapping icinga Letsencrypt TLS cert alerts around renewal time
Open, HighPublic

Description

We are getting multiple (new?) Icinga CRITs for the same thing, that TLS cert for cloudelastic.wikimedia.org expires in 7 days.

But these are Letsencrypt certs and it looks like both renewal period is 7 days and monitoring is set to go CRIT at 7 days.

For some reason one of them recovered shortly after but the others have not and after refreshing all 3 in Icinga they are still CRIT.

This does not seem to be an issue with the actual renewal, we saw at least one of them get a new cert as well, but I think there is at least this to fix here:

  • change puppet code so that we don't check the same cert for the same host name on multiple servers? to avoid duplicate alerts?
  • change thresholds so there are no races on the day of renewal (btw the new one it just got will expired on Christmas :)

current status is still like in screenshot below

Screenshot at 2021-10-19 12-50-38.png (113×1 px, 49 KB)

but here is the new cert already, I confirmed that:

[puppetmaster1001:~] $ curl -6 -S -vvv https://cloudelastic.wikimedia.org:9243
* Server certificate:
*  subject: CN=cloudelastic.wikimedia.org
*  start date: Sep 27 19:00:30 2021 GMT
*  expire date: Dec 26 19:00:29 2021 GMT
*  subjectAltName: host "cloudelastic.wikimedia.org" matched cert's "cloudelastic.wikimedia.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.

See also T308908#7957275 for a bit more debugging. It notably shows an Apache 2 worker is not properly restarted after a graceful reload (shows as no (old gen) in Apache status) and thus it keeps running with the old certificates.

Upstream Apache 2 is most probably 63169: MPM event, stuck process after graceful: no (old gen) which is in Apache 2.4.49 (we run 2.4.38-3+deb10u7)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Dzahn triaged this task as Low priority.Oct 19 2021, 8:36 PM
Dzahn edited projects, added Discovery-Search; removed cloud-services-team (Kanban).
Dzahn added a subscriber: RKemper.
20:36 <+icinga-wm> ACKNOWLEDGEMENT - WMF Cloud -Psi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org 
                   expires in 7 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). daniel_zahn https://phabricator.wikimedia.org/T293826 
                   https://wikitech.wikimedia.org/wiki/Search%23Administration

ACKed and prio low .. I guess.

additional issue is these alerts are flapping and keep coming back so a simple ACK wasn't enough. downtiming for a day for now

same thing happened today for lists.wikimedia.org, it alerted and then recovered 2 minutes later. In general we have renewal = 7 days and alerting = 7 days.

we should change them all to something like "6 days and 23 hours"

Dzahn renamed this task from cloudelastic icinga TLS cert alerts to flapping icinga Letsencrypt TLS cert alerts around renewal time .Oct 20 2021, 6:19 PM

this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I read yesterday via Planet Debian

Interesting. Lists runs apache, although the cloudelastic hosts look like nginx.

I'm seeing 6 days being logged as well now. I wonder if the 7 days monitoring is coincidental, since that's the threshold we pass to check_http for check_https_lvs_on_port.

Oct 20 19:06:09 alert1001 icinga: SERVICE ALERT: cloudelastic.wikimedia.org;WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS;CRITICAL;SOFT;2;CRITICAL - Certificate 'cloudelastic.wikimedia.org' expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000)

Both the apache and nginx on cloudelastic and lists have been running for months, so fully restarting them seems worth a shot to see if that stops the flapping (for a few weeks/months anyway). I haven't touched them yet in case someone wants to review them in their current state first.

lists.wm.o is serving both the old and new cert, just like the blog post mentioned:

1km@cashew ~> while true; echo -n | openssl s_client -servername lists.wikimedia.org -connect lists.wikimedia.org:443 2>/dev/null | openssl x509 -noout -dates | grep notAfter; end
2notAfter=Oct 28 09:00:44 2021 GMT
3notAfter=Dec 27 09:00:28 2021 GMT
4notAfter=Oct 28 09:00:44 2021 GMT
5notAfter=Dec 27 09:00:28 2021 GMT
6notAfter=Dec 27 09:00:28 2021 GMT
7notAfter=Oct 28 09:00:44 2021 GMT
8notAfter=Oct 28 09:00:44 2021 GMT
9notAfter=Dec 27 09:00:28 2021 GMT
10notAfter=Oct 28 09:00:44 2021 GMT
11notAfter=Oct 28 09:00:44 2021 GMT
12notAfter=Dec 27 09:00:28 2021 GMT
13notAfter=Dec 27 09:00:28 2021 GMT
14notAfter=Oct 28 09:00:44 2021 GMT
15notAfter=Oct 28 09:00:44 2021 GMT
16notAfter=Oct 28 09:00:44 2021 GMT
17notAfter=Oct 28 09:00:44 2021 GMT
18notAfter=Oct 28 09:00:44 2021 GMT
19notAfter=Oct 28 09:00:44 2021 GMT
20notAfter=Oct 28 09:00:44 2021 GMT
21notAfter=Oct 28 09:00:44 2021 GMT
22notAfter=Oct 28 09:00:44 2021 GMT
23notAfter=Oct 28 09:00:44 2021 GMT
24notAfter=Dec 27 09:00:28 2021 GMT
25notAfter=Oct 28 09:00:44 2021 GMT
26notAfter=Oct 28 09:00:44 2021 GMT
27notAfter=Dec 27 09:00:28 2021 GMT
28notAfter=Dec 27 09:00:28 2021 GMT
29notAfter=Dec 27 09:00:28 2021 GMT
30notAfter=Dec 27 09:00:28 2021 GMT
31notAfter=Dec 27 09:00:28 2021 GMT
32notAfter=Dec 27 09:00:28 2021 GMT
33notAfter=Dec 27 09:00:28 2021 GMT
34notAfter=Dec 27 09:00:28 2021 GMT
35notAfter=Dec 27 09:00:28 2021 GMT
36notAfter=Oct 28 09:00:44 2021 GMT
37notAfter=Dec 27 09:00:28 2021 GMT
38notAfter=Dec 27 09:00:28 2021 GMT
39notAfter=Dec 27 09:00:28 2021 GMT
40notAfter=Dec 27 09:00:28 2021 GMT
41notAfter=Dec 27 09:00:28 2021 GMT
42notAfter=Dec 27 09:00:28 2021 GMT
43notAfter=Oct 28 09:00:44 2021 GMT
44notAfter=Dec 27 09:00:28 2021 GMT
45notAfter=Oct 28 09:00:44 2021 GMT
46notAfter=Dec 27 09:00:28 2021 GMT
47notAfter=Dec 27 09:00:28 2021 GMT
48notAfter=Dec 27 09:00:28 2021 GMT
49notAfter=Dec 27 09:00:28 2021 GMT
50notAfter=Oct 28 09:00:44 2021 GMT
51notAfter=Dec 27 09:00:28 2021 GMT
52notAfter=Dec 27 09:00:28 2021 GMT
53notAfter=Dec 27 09:00:28 2021 GMT
54notAfter=Dec 27 09:00:28 2021 GMT
55notAfter=Dec 27 09:00:28 2021 GMT
56notAfter=Oct 28 09:00:44 2021 GMT
57notAfter=Oct 28 09:00:44 2021 GMT
58notAfter=Dec 27 09:00:28 2021 GMT
59notAfter=Dec 27 09:00:28 2021 GMT
60notAfter=Dec 27 09:00:28 2021 GMT
61notAfter=Dec 27 09:00:28 2021 GMT
62notAfter=Dec 27 09:00:28 2021 GMT
63notAfter=Oct 28 09:00:44 2021 GMT
64notAfter=Dec 27 09:00:28 2021 GMT
65notAfter=Dec 27 09:00:28 2021 GMT
66notAfter=Dec 27 09:00:28 2021 GMT
67notAfter=Oct 28 09:00:44 2021 GMT
68notAfter=Dec 27 09:00:28 2021 GMT
69notAfter=Oct 28 09:00:44 2021 GMT
70notAfter=Oct 28 09:00:44 2021 GMT
71notAfter=Oct 28 09:00:44 2021 GMT
72notAfter=Dec 27 09:00:28 2021 GMT
73notAfter=Oct 28 09:00:44 2021 GMT
74notAfter=Oct 28 09:00:44 2021 GMT
75notAfter=Oct 28 09:00:44 2021 GMT
76notAfter=Oct 28 09:00:44 2021 GMT
77notAfter=Dec 27 09:00:28 2021 GMT
78notAfter=Oct 28 09:00:44 2021 GMT
79notAfter=Oct 28 09:00:44 2021 GMT
80notAfter=Dec 27 09:00:28 2021 GMT
81notAfter=Oct 28 09:00:44 2021 GMT
82notAfter=Oct 28 09:00:44 2021 GMT
83notAfter=Oct 28 09:00:44 2021 GMT
84notAfter=Dec 27 09:00:28 2021 GMT
85notAfter=Dec 27 09:00:28 2021 GMT
86notAfter=Dec 27 09:00:28 2021 GMT
87notAfter=Dec 27 09:00:28 2021 GMT
88notAfter=Dec 27 09:00:28 2021 GMT
89notAfter=Dec 27 09:00:28 2021 GMT
90notAfter=Dec 27 09:00:28 2021 GMT
91notAfter=Oct 28 09:00:44 2021 GMT
92notAfter=Dec 27 09:00:28 2021 GMT
93^C⏎

Legoktm raised the priority of this task from Low to High.Oct 21 2021, 4:01 AM

Mentioned in SAL (#wikimedia-operations) [2021-10-21T04:04:02Z] <legoktm> restarted apache on lists1001 so it only uses new TLS cert (T293826)

After restarting apache, and ~50 requests later, I'm only getting the new certificate. Marking as high because the monitoring is picking up a real issue, it's just flappy since ~half the requests pass and half fail.

Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (not sure if that's intentional):

syslog.log-20211009.gz:Oct  8 10:23:33 lists1001 puppet-agent[32311]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211011.gz:Oct 10 22:22:46 lists1001 puppet-agent[22789]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211014.gz:Oct 13 10:23:26 lists1001 puppet-agent[8325]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211016.gz:Oct 15 22:23:11 lists1001 puppet-agent[11417]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211019.gz:Oct 18 11:23:03 lists1001 puppet-agent[30781]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]

(doesn't actually matter, seems like apache reloads config at midnight every day anyways)

Mentioned in SAL (#wikimedia-operations) [2021-10-21T06:35:55Z] <elukey> systemctl reload nginx on cloudelastic100[5,6] to pick up the new TLS certificate and clear alerts - T293826

Change 732611 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] tlsproxy::localssl: acme_chief should notify nginx

https://gerrit.wikimedia.org/r/732611

Maybe it is totally off, but I saw that the cloudelastic nodes use the tlsproxy::localssl define (via elasticsearch::tlsproxy) and it doesn't reload nginx when a new cert is available. Created https://gerrit.wikimedia.org/r/c/operations/puppet/+/732611 as proposal to improve the current situation, I see the same pattern applied to other acme_chief::cert usages.

MPhamWMF set the point value for this task to 1.Oct 25 2021, 3:49 PM
MPhamWMF moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

Change 732611 merged by Elukey:

[operations/puppet@production] tlsproxy::localssl: acme_chief should notify nginx

https://gerrit.wikimedia.org/r/732611

The nginx problems should be fixed with https://gerrit.wikimedia.org/r/732611 in theory (so elastic/cloudelastic/relforge/etc..).

Mentioned in SAL (#wikimedia-operations) [2021-10-21T04:04:02Z] <legoktm> restarted apache on lists1001 so it only uses new TLS cert (T293826)

This one seems sneaky to reproduce, I am wondering if recent versions of httpd have it fixed (upstream is now at 2.4.51, and several changes to mpm-event happened since .38, see https://www.apachelounge.com/Changelog-2.4.html and grep for event).

@Dzahn do you know if the list[12]* nodes are scheduled to be upgraded to Bullseye during the next few weeks? It would be really nice to test the new version of httpd, I think it should solve the issue that @Legoktm brought up.

@elukey Not list* but we could potentially test it with librenms.wikimedia.org. That fulfills the requirements of "uses LE", "has public IP" and "uses httpd" at least, unlike most other things.

Another option could be to use httpd from buster-backports, but https://packages.debian.org/buster-backports/apache2 still lists 2.4.48 (meanwhile on salsa I see .51, that contains a lot of CVE fixes from upstream). If the buster-backport version gets updated to .51 we could install it and avoid bullseye.

Mentioned in SAL (#wikimedia-cloud) [2021-12-06T13:23:49Z] <majavah> root@toolserver-proxy-01:~# systemctl restart apache2.service # working around T293826

Mentioned in SAL (#wikimedia-operations) [2021-12-19T17:10:32Z] <Amir1> restart apache2 on lists1001 (T293826)

that's intended, every time that acme-chief fetches fresh OCSP stapling responses it issues a reload of apache2

Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (not sure if that's intentional):

syslog.log-20211009.gz:Oct  8 10:23:33 lists1001 puppet-agent[32311]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211011.gz:Oct 10 22:22:46 lists1001 puppet-agent[22789]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211014.gz:Oct 13 10:23:26 lists1001 puppet-agent[8325]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211016.gz:Oct 15 22:23:11 lists1001 puppet-agent[11417]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211019.gz:Oct 18 11:23:03 lists1001 puppet-agent[30781]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]

(doesn't actually matter, seems like apache reloads config at midnight every day anyways)

hashar subscribed.

gerrit1001.wikimedia.org is showing up the issue:

From apache-status:

Current Time: Wednesday, 25-May-2022 15:16:44 UTC

Restart Time: Friday, 22-Apr-2022 19:59:53 UTC
+------------------------------------------------------------------------+
|    |     |        |Connections    |Threads  |Async connections         |
|Slot|PID  |Stopping|---------------+---------+--------------------------|
|    |     |        |total|accepting|busy|idle|writing|keep-alive|closing|
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|0   |6376 |no      |1    |yes      |0   |25  |0      |0         |1      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|1   |19045|no      |2    |yes      |1   |24  |0      |0         |1      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|2   |1390 |no (old |0    |yes      |0   |25  |0      |0         |0      |
|    |     |gen)    |     |         |    |    |       |          |       |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|3   |6377 |no      |3    |yes      |0   |25  |0      |2         |0      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|Sum |4    |0       |6    |         |1   |99  |0      |2         |2      |
+------------------------------------------------------------------------+

The slot with pid 1390 is marked no (old gen) which is the one from April 23 in ps above.

We have Apache 2.4.38-3+deb10u7

I found upstream bug 63169: MPM event, stuck process after graceful: no (old gen) reported against 2.4.37 which says a graceful restart fails to kill some processes.

Looking at the stuck pid 1390:

apache2 1390 www-data   13w      REG       9,0    3119423    396450 /var/log/apache2/gerrit.wikimedia.org.https.error.log.1 (deleted)
apache2 1390 www-data   14w      REG       9,0       4215    396446 /var/log/apache2/gerrit.wikimedia.org.http.error.log.1 (deleted)
apache2 1390 www-data   15w      REG       9,0    2661276    396470 /var/log/apache2/other_vhosts_access.log.1 (deleted)
apache2 1390 www-data   16w      REG       9,0    7803259    396467 /var/log/apache2/other_vhosts_access-json.log.1 (deleted)
apache2 1390 www-data   17w      REG       9,0 9308490812    396448 /var/log/apache2/gerrit.wikimedia.org.https.access.log.1 (deleted)
apache2 1390 www-data   18w      REG       9,0    1172054    396445 /var/log/apache2/gerrit.wikimedia.org.http.access.log.1 (deleted)

The merge patch is https://svn.apache.org/viewvc?view=revision&revision=1893109 and is in Apache 2.4.49. That might be the one :]

The brute workaround is to restart Apache2 entirely?

This just alerted again:

14:09:04 <+icinga-wm> PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring

PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
lists1001:~# systemctl restart apache2.service
RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
BCornwall removed the point value for this task.

This happened again for lists1001. Requested (and it has been) restart in #wikimedia-sre