Change Details

We are getting multiple (new?) Icinga CRITs for the same thing, that TLS cert for cloudelastic.wikimedia.org expires in 7 days. But these are Letsencrypt certs and it looks like both renewal period is 7 days and monitoring is set to go CRIT at 7 days. For some reason one of them recovered shortly after but the others have not and after refreshing all 3 in Icinga they are still CRIT. This does not seem to be an issue with the actual renewal, we saw at least one of them get a new cert as well, but I think there is at least this to fix here: - change puppet code so that we don't check the same cert for the same host name on multiple servers? to avoid duplicate alerts? - change thresholds so there are no races on the day of renewal (btw the new one it just got will expired on Christmas :) current status is still like in screenshot below {F34698472} but here is the new cert already, I confirmed that: ``` [puppetmaster1001:~] $ curl -6 -S -vvv https://cloudelastic.wikimedia.org:9243 ``` ``` * Server certificate: * subject: CN=cloudelastic.wikimedia.org * start date: Sep 27 19:00:30 2021 GMT * expire date: Dec 26 19:00:29 2021 GMT * subjectAltName: host "cloudelastic.wikimedia.org" matched cert's "cloudelastic.wikimedia.org" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok. ``` See also T308908#7957275 for a bit more debugging. It notably shows an Apache 2 worker is not properly restarted after a graceful reload (shows as `no (old gen)` in Apache status) and thus it keeps running with the old certificates. #upstream Apache 2 is most probably [[https://bz.apache.org/bugzilla/show_bug.cgi?id=63169 | 63169: MPM event, stuck process after graceful: no (old gen) ]] which is in Apache 2.4.49 (we run 2.4.38-3+deb10u7)