Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
icinga/planet: use letsencrypt check command for https cert monitoring | operations/puppet | production | +1 -1 |
Details
Event Timeline
This is a Letsencrypt cert, so probably auto-renews. But we still have 2 alerts in Icinga expecting this needs attention if it expires in under 30 days.
I acked those.
indeed, it's auto-renewed by acme-chief, we should tune those checks.
The new cert has been issued already and it's being staged to avoid client-side clock skew issues:
Jul 15 12:00:02 acmechief1001 acme-chief-backend[3725]: Staging_time will be enforced for unified / rsa-2048 till 2021-07-22 08:02:06
Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold? If traffic doesn't need those alerts we can just remove that nowadays.
@Vgutierrez: A good first task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributor. Given the current short task description I'm removing the good first task tag. Please add details what exactly has to happen where and how for a new contributor, and then add back the good first task project tag. Thanks a lot in advance!
we should reduce the threshold, 3 weeks should be better for a LE acme-chief managed cert
ACK!
So.. we still want to monitor if TLS works on planet and phabricator, we just don't want to deal with cert expiry anymore. We need to create a new checkcommand probably. One of the less obvious parts to get right in Icinga/puppet, so no sure about the "good first task" but I will take it.
But the planet cert (https://en.planet.wikimedia.org/ and other language subdomains of it) is still a DigiCert cert and not an Letsencrypt cert.
That's why the fix isn't just replacing the "check_ssl_http" with "check_ssl_http_letsencrypt" which it would have been if that was the case.
Same with the https://phabricator.wikimedia.org cert, it is still a DigiCert cert for me. So this is about adjusting the monitoring for that, the non-LE certs.
Does the traffic team want to keep the alerts for the DigiCert cert and should it be 30 days as before?
Change 706410 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] icinga/planet: use letsencrypt check command for https cert monitoring
Change 706410 merged by Dzahn:
[operations/puppet@production] icinga/planet: use letsencrypt check command for https cert monitoring
Thinking about this again I think we are good here now. One of the 2 checks was removed, the other stayed. So we do not have a duplicate alert anymore but we still keep one of them because otherwise we would have to monitoring for expiring non-LE certs. (and it depends on geo location which one is being presented to the clients and/or Icinga). Let me know if you disagree.