We are getting multiple (new?) Icinga CRITs for the same thing, that TLS cert for cloudelastic.wikimedia.org expires in 7 days.
But these are Letsencrypt certs and it looks like both renewal period is 7 days and monitoring is set to go CRIT at 7 days.
For some reason one of them recovered shortly after but the others have not and after refreshing all 3 in Icinga they are still CRIT.
This does not seem to be an issue with the actual renewal, we saw at least one of them get a new cert as well, but I think there is at least this to fix here:
- change puppet code so that we don't check the same cert for the same host name on multiple servers? to avoid duplicate alerts?
- change thresholds so there are no races on the day of renewal (btw the new one it just got will expired on Christmas :)
current status is still like in screenshot below
but here is the new cert already, I confirmed that:
[puppetmaster1001:~] $ curl -6 -S -vvv https://cloudelastic.wikimedia.org:9243
* Server certificate: * subject: CN=cloudelastic.wikimedia.org * start date: Sep 27 19:00:30 2021 GMT * expire date: Dec 26 19:00:29 2021 GMT * subjectAltName: host "cloudelastic.wikimedia.org" matched cert's "cloudelastic.wikimedia.org" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok.
See also T308908#7957275 for a bit more debugging. It notably shows an Apache 2 worker is not properly restarted after a graceful reload (shows as no (old gen) in Apache status) and thus it keeps running with the old certificates.
Upstream Apache 2 is most probably 63169: MPM event, stuck process after graceful: no (old gen) which is in Apache 2.4.49 (we run 2.4.38-3+deb10u7)