flapping icinga Letsencrypt TLS cert alerts around renewal time
Open, HighPublic
Actions

Assigned To

None

Authored By

	Dzahn
	Oct 19 2021, 8:31 PM

Description

We are getting multiple (new?) Icinga CRITs for the same thing, that TLS cert for cloudelastic.wikimedia.org expires in 7 days.

But these are Letsencrypt certs and it looks like both renewal period is 7 days and monitoring is set to go CRIT at 7 days.

For some reason one of them recovered shortly after but the others have not and after refreshing all 3 in Icinga they are still CRIT.

This does not seem to be an issue with the actual renewal, we saw at least one of them get a new cert as well, but I think there is at least this to fix here:

change puppet code so that we don't check the same cert for the same host name on multiple servers? to avoid duplicate alerts?
change thresholds so there are no races on the day of renewal (btw the new one it just got will expired on Christmas :)

current status is still like in screenshot below

but here is the new cert already, I confirmed that:

[puppetmaster1001:~] $ curl -6 -S -vvv https://cloudelastic.wikimedia.org:9243

* Server certificate:
*  subject: CN=cloudelastic.wikimedia.org
*  start date: Sep 27 19:00:30 2021 GMT
*  expire date: Dec 26 19:00:29 2021 GMT
*  subjectAltName: host "cloudelastic.wikimedia.org" matched cert's "cloudelastic.wikimedia.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.

See also T308908#7957275 for a bit more debugging. It notably shows an Apache 2 worker is not properly restarted after a graceful reload (shows as no (old gen) in Apache status) and thus it keeps running with the old certificates.

Upstream Apache 2 is most probably 63169: MPM event, stuck process after graceful: no (old gen) which is in Apache 2.4.49 (we run 2.4.38-3+deb10u7)

Details

	Subject	Repo	Branch	Lines +/-
	tlsproxy::localssl: acme_chief should notify nginx	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T293826 flapping icinga Letsencrypt TLS cert alerts around renewal time
		Resolved		Ladsgroup	T315294 lists.wikimedia.org flapping between soon to expire and renewed cert (Aug 2022)

Event Timeline

Dzahn created this task.Oct 19 2021, 8:31 PM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptOct 19 2021, 8:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Dzahn updated the task description. (Show Details)Oct 19 2021, 8:34 PM

Dzahn added a project: Traffic.

20:36 <+icinga-wm> ACKNOWLEDGEMENT - WMF Cloud -Psi Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Certificate cloudelastic.wikimedia.org 
                   expires in 7 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000). daniel_zahn https://phabricator.wikimedia.org/T293826 
                   https://wikitech.wikimedia.org/wiki/Search%23Administration

ACKed and prio low .. I guess.

additional issue is these alerts are flapping and keep coming back so a simple ACK wasn't enough. downtiming for a day for now

same thing happened today for lists.wikimedia.org, it alerted and then recovered 2 minutes later. In general we have renewal = 7 days and alerting = 7 days.

we should change them all to something like "6 days and 23 hours"

Dzahn renamed this task from cloudelastic icinga TLS cert alerts to flapping icinga Letsencrypt TLS cert alerts around renewal time .Oct 20 2021, 6:19 PM

this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I read yesterday via Planet Debian

In T293826#7445653, @Legoktm wrote:

this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I read yesterday via Planet Debian

Interesting. Lists runs apache, although the cloudelastic hosts look like nginx.

I'm seeing 6 days being logged as well now. I wonder if the 7 days monitoring is coincidental, since that's the threshold we pass to check_http for check_https_lvs_on_port.

Oct 20 19:06:09 alert1001 icinga: SERVICE ALERT: cloudelastic.wikimedia.org;WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS;CRITICAL;SOFT;2;CRITICAL - Certificate 'cloudelastic.wikimedia.org' expires in 6 day(s) (Wed 27 Oct 2021 07:00:23 PM GMT +0000)

Both the apache and nginx on cloudelastic and lists have been running for months, so fully restarting them seems worth a shot to see if that stops the flapping (for a few weeks/months anyway). I haven't touched them yet in case someone wants to review them in their current state first.

lists.wm.o is serving both the old and new cert, just like the blog post mentioned:

P17556 (An Untitled Masterwork)

1	km@cashew ~> while true; echo -n \| openssl s_client -servername lists.wikimedia.org -connect lists.wikimedia.org:443 2>/dev/null \| openssl x509 -noout -dates \| grep notAfter; end
2	notAfter=Oct 28 09:00:44 2021 GMT
3	notAfter=Dec 27 09:00:28 2021 GMT
4	notAfter=Oct 28 09:00:44 2021 GMT
5	notAfter=Dec 27 09:00:28 2021 GMT
6	notAfter=Dec 27 09:00:28 2021 GMT
7	notAfter=Oct 28 09:00:44 2021 GMT
8	notAfter=Oct 28 09:00:44 2021 GMT
9	notAfter=Dec 27 09:00:28 2021 GMT
10	notAfter=Oct 28 09:00:44 2021 GMT
11	notAfter=Oct 28 09:00:44 2021 GMT
12	notAfter=Dec 27 09:00:28 2021 GMT
13	notAfter=Dec 27 09:00:28 2021 GMT
14	notAfter=Oct 28 09:00:44 2021 GMT
15	notAfter=Oct 28 09:00:44 2021 GMT
16	notAfter=Oct 28 09:00:44 2021 GMT
17	notAfter=Oct 28 09:00:44 2021 GMT
18	notAfter=Oct 28 09:00:44 2021 GMT
19	notAfter=Oct 28 09:00:44 2021 GMT
20	notAfter=Oct 28 09:00:44 2021 GMT
21	notAfter=Oct 28 09:00:44 2021 GMT
22	notAfter=Oct 28 09:00:44 2021 GMT
23	notAfter=Oct 28 09:00:44 2021 GMT
24	notAfter=Dec 27 09:00:28 2021 GMT
25	notAfter=Oct 28 09:00:44 2021 GMT
26	notAfter=Oct 28 09:00:44 2021 GMT
27	notAfter=Dec 27 09:00:28 2021 GMT
28	notAfter=Dec 27 09:00:28 2021 GMT
29	notAfter=Dec 27 09:00:28 2021 GMT
30	notAfter=Dec 27 09:00:28 2021 GMT
31	notAfter=Dec 27 09:00:28 2021 GMT
32	notAfter=Dec 27 09:00:28 2021 GMT
33	notAfter=Dec 27 09:00:28 2021 GMT
34	notAfter=Dec 27 09:00:28 2021 GMT
35	notAfter=Dec 27 09:00:28 2021 GMT
36	notAfter=Oct 28 09:00:44 2021 GMT
37	notAfter=Dec 27 09:00:28 2021 GMT
38	notAfter=Dec 27 09:00:28 2021 GMT
39	notAfter=Dec 27 09:00:28 2021 GMT
40	notAfter=Dec 27 09:00:28 2021 GMT
41	notAfter=Dec 27 09:00:28 2021 GMT
42	notAfter=Dec 27 09:00:28 2021 GMT
43	notAfter=Oct 28 09:00:44 2021 GMT
44	notAfter=Dec 27 09:00:28 2021 GMT
45	notAfter=Oct 28 09:00:44 2021 GMT
46	notAfter=Dec 27 09:00:28 2021 GMT
47	notAfter=Dec 27 09:00:28 2021 GMT
48	notAfter=Dec 27 09:00:28 2021 GMT
49	notAfter=Dec 27 09:00:28 2021 GMT
50	notAfter=Oct 28 09:00:44 2021 GMT
51	notAfter=Dec 27 09:00:28 2021 GMT
52	notAfter=Dec 27 09:00:28 2021 GMT
53	notAfter=Dec 27 09:00:28 2021 GMT
54	notAfter=Dec 27 09:00:28 2021 GMT
55	notAfter=Dec 27 09:00:28 2021 GMT
56	notAfter=Oct 28 09:00:44 2021 GMT
57	notAfter=Oct 28 09:00:44 2021 GMT
58	notAfter=Dec 27 09:00:28 2021 GMT
59	notAfter=Dec 27 09:00:28 2021 GMT
60	notAfter=Dec 27 09:00:28 2021 GMT
61	notAfter=Dec 27 09:00:28 2021 GMT
62	notAfter=Dec 27 09:00:28 2021 GMT
63	notAfter=Oct 28 09:00:44 2021 GMT
64	notAfter=Dec 27 09:00:28 2021 GMT
65	notAfter=Dec 27 09:00:28 2021 GMT
66	notAfter=Dec 27 09:00:28 2021 GMT
67	notAfter=Oct 28 09:00:44 2021 GMT
68	notAfter=Dec 27 09:00:28 2021 GMT
69	notAfter=Oct 28 09:00:44 2021 GMT
70	notAfter=Oct 28 09:00:44 2021 GMT
71	notAfter=Oct 28 09:00:44 2021 GMT
72	notAfter=Dec 27 09:00:28 2021 GMT
73	notAfter=Oct 28 09:00:44 2021 GMT
74	notAfter=Oct 28 09:00:44 2021 GMT
75	notAfter=Oct 28 09:00:44 2021 GMT
76	notAfter=Oct 28 09:00:44 2021 GMT
77	notAfter=Dec 27 09:00:28 2021 GMT
78	notAfter=Oct 28 09:00:44 2021 GMT
79	notAfter=Oct 28 09:00:44 2021 GMT
80	notAfter=Dec 27 09:00:28 2021 GMT
81	notAfter=Oct 28 09:00:44 2021 GMT
82	notAfter=Oct 28 09:00:44 2021 GMT
83	notAfter=Oct 28 09:00:44 2021 GMT
84	notAfter=Dec 27 09:00:28 2021 GMT
85	notAfter=Dec 27 09:00:28 2021 GMT
86	notAfter=Dec 27 09:00:28 2021 GMT
87	notAfter=Dec 27 09:00:28 2021 GMT
88	notAfter=Dec 27 09:00:28 2021 GMT
89	notAfter=Dec 27 09:00:28 2021 GMT
90	notAfter=Dec 27 09:00:28 2021 GMT
91	notAfter=Oct 28 09:00:44 2021 GMT
92	notAfter=Dec 27 09:00:28 2021 GMT
93	^C⏎

Legoktm raised the priority of this task from Low to High.Oct 21 2021, 4:01 AM

Mentioned in SAL (#wikimedia-operations) [2021-10-21T04:04:02Z] <legoktm> restarted apache on lists1001 so it only uses new TLS cert (T293826)

After restarting apache, and ~50 requests later, I'm only getting the new certificate. Marking as high because the monitoring is picking up a real issue, it's just flappy since ~half the requests pass and half fail.

Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (not sure if that's intentional):

syslog.log-20211009.gz:Oct  8 10:23:33 lists1001 puppet-agent[32311]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211011.gz:Oct 10 22:22:46 lists1001 puppet-agent[22789]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211014.gz:Oct 13 10:23:26 lists1001 puppet-agent[8325]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211016.gz:Oct 15 22:23:11 lists1001 puppet-agent[11417]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211019.gz:Oct 18 11:23:03 lists1001 puppet-agent[30781]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]

(doesn't actually matter, seems like apache reloads config at midnight every day anyways)

Mentioned in SAL (#wikimedia-operations) [2021-10-21T06:35:55Z] <elukey> systemctl reload nginx on cloudelastic100[5,6] to pick up the new TLS certificate and clear alerts - T293826

Change 732611 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] tlsproxy::localssl: acme_chief should notify nginx

https://gerrit.wikimedia.org/r/732611

gerritbot added a project: Patch-For-Review.Oct 21 2021, 7:23 AM

Maybe it is totally off, but I saw that the cloudelastic nodes use the tlsproxy::localssl define (via elasticsearch::tlsproxy) and it doesn't reload nginx when a new cert is available. Created https://gerrit.wikimedia.org/r/c/operations/puppet/+/732611 as proposal to improve the current situation, I see the same pattern applied to other acme_chief::cert usages.

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Oct 25 2021, 3:08 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• MPhamWMF set the point value for this task to 1.Oct 25 2021, 3:49 PM

• MPhamWMF moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.

Change 732611 merged by Elukey:

[operations/puppet@production] tlsproxy::localssl: acme_chief should notify nginx

https://gerrit.wikimedia.org/r/732611

Maintenance_bot removed a project: Patch-For-Review.Oct 26 2021, 10:10 AM

The nginx problems should be fixed with https://gerrit.wikimedia.org/r/732611 in theory (so elastic/cloudelastic/relforge/etc..).

In T293826#7446833, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2021-10-21T04:04:02Z] <legoktm> restarted apache on lists1001 so it only uses new TLS cert (T293826)

This one seems sneaky to reproduce, I am wondering if recent versions of httpd have it fixed (upstream is now at 2.4.51, and several changes to mpm-event happened since .38, see https://www.apachelounge.com/Changelog-2.4.html and grep for event).

@Dzahn do you know if the list[12]* nodes are scheduled to be upgraded to Bullseye during the next few weeks? It would be really nice to test the new version of httpd, I think it should solve the issue that @Legoktm brought up.

@elukey Not list* but we could potentially test it with librenms.wikimedia.org. That fulfills the requirements of "uses LE", "has public IP" and "uses httpd" at least, unlike most other things.

Another option could be to use httpd from buster-backports, but https://packages.debian.org/buster-backports/apache2 still lists 2.4.48 (meanwhile on salsa I see .51, that contains a lot of CVE fixes from upstream). If the buster-backport version gets updated to .51 we could install it and avoid bullseye.

Gehel moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.Nov 22 2021, 4:35 PM

Gehel removed a project: Discovery-Search (Current work).Nov 29 2021, 1:51 PM

Mentioned in SAL (#wikimedia-cloud) [2021-12-06T13:23:49Z] <majavah> root@toolserver-proxy-01:~# systemctl restart apache2.service # working around T293826

Mentioned in SAL (#wikimedia-operations) [2021-12-19T17:10:32Z] <Amir1> restart apache2 on lists1001 (T293826)

herron moved this task from Inbox to Radar on the observability board.Jan 12 2022, 5:43 PM

RhinosF1 mentioned this in T308908: Icinga Check SSL might have a time based race condition.May 21 2022, 10:05 AM

that's intended, every time that acme-chief fetches fresh OCSP stapling responses it issues a reload of apache2

In T293826#7446839, @Legoktm wrote:
Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (not sure if that's intentional):
syslog.log-20211009.gz:Oct  8 10:23:33 lists1001 puppet-agent[32311]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211011.gz:Oct 10 22:22:46 lists1001 puppet-agent[22789]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211014.gz:Oct 13 10:23:26 lists1001 puppet-agent[8325]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211016.gz:Oct 15 22:23:11 lists1001 puppet-agent[11417]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
syslog.log-20211019.gz:Oct 18 11:23:03 lists1001 puppet-agent[30781]: (/etc/acmecerts/lists) Scheduling refresh of Service[apache2]
(doesn't actually matter, seems like apache reloads config at midnight every day anyways)

gerrit1001.wikimedia.org is showing up the issue:

In T308908#7957451, @hashar wrote:

From apache-status:

Current Time: Wednesday, 25-May-2022 15:16:44 UTC

Restart Time: Friday, 22-Apr-2022 19:59:53 UTC

+------------------------------------------------------------------------+
|    |     |        |Connections    |Threads  |Async connections         |
|Slot|PID  |Stopping|---------------+---------+--------------------------|
|    |     |        |total|accepting|busy|idle|writing|keep-alive|closing|
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|0   |6376 |no      |1    |yes      |0   |25  |0      |0         |1      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|1   |19045|no      |2    |yes      |1   |24  |0      |0         |1      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|2   |1390 |no (old |0    |yes      |0   |25  |0      |0         |0      |
|    |     |gen)    |     |         |    |    |       |          |       |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|3   |6377 |no      |3    |yes      |0   |25  |0      |2         |0      |
|----+-----+--------+-----+---------+----+----+-------+----------+-------|
|Sum |4    |0       |6    |         |1   |99  |0      |2         |2      |
+------------------------------------------------------------------------+

The slot with pid 1390 is marked no (old gen) which is the one from April 23 in ps above.

We have Apache 2.4.38-3+deb10u7

I found upstream bug 63169: MPM event, stuck process after graceful: no (old gen) reported against 2.4.37 which says a graceful restart fails to kill some processes.

Looking at the stuck pid 1390:

apache2 1390 www-data   13w      REG       9,0    3119423    396450 /var/log/apache2/gerrit.wikimedia.org.https.error.log.1 (deleted)
apache2 1390 www-data   14w      REG       9,0       4215    396446 /var/log/apache2/gerrit.wikimedia.org.http.error.log.1 (deleted)
apache2 1390 www-data   15w      REG       9,0    2661276    396470 /var/log/apache2/other_vhosts_access.log.1 (deleted)
apache2 1390 www-data   16w      REG       9,0    7803259    396467 /var/log/apache2/other_vhosts_access-json.log.1 (deleted)
apache2 1390 www-data   17w      REG       9,0 9308490812    396448 /var/log/apache2/gerrit.wikimedia.org.https.access.log.1 (deleted)
apache2 1390 www-data   18w      REG       9,0    1172054    396445 /var/log/apache2/gerrit.wikimedia.org.http.access.log.1 (deleted)

The merge patch is https://svn.apache.org/viewvc?view=revision&revision=1893109 and is in Apache 2.4.49. That might be the one :]

The brute workaround is to restart Apache2 entirely?

hashar updated the task description. (Show Details)Jun 3 2022, 8:52 AM

This just alerted again:

14:09:04 <+icinga-wm> PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring

Ladsgroup closed subtask T315294: lists.wikimedia.org flapping between soon to expire and renewed cert (Aug 2022) as Resolved.Aug 17 2022, 3:53 PM

PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Mon 20 Feb 2023 05:31:14 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring

lists1001:~# systemctl restart apache2.service

RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring

hashar unsubscribed.Feb 20 2023, 8:29 AM

andrea.denisse subscribed.Feb 22 2023, 3:31 PM

BCornwall moved this task from Backlog to Ready for work on the Traffic board.Mar 6 2023, 11:47 PM

BCornwall subscribed.

BCornwall moved this task from Ready for work to Backlog on the Traffic board.Mar 30 2023, 8:42 PM

BCornwall removed the point value for this task.

This happened again for lists1001. Requested (and it has been) restart in #wikimedia-sre

	F34698472: Screenshot at 2021-10-19 12-50-38.png
	Oct 19 2021, 8:31 PM

flapping icinga Letsencrypt TLS cert alerts around renewal time Open, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

flapping icinga Letsencrypt TLS cert alerts around renewal time
Open, HighPublic
Actions

Related Objects
Search...