Page MenuHomePhabricator

(adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46
Closed, ResolvedPublic

Event Timeline

This is a Letsencrypt cert, so probably auto-renews. But we still have 2 alerts in Icinga expecting this needs attention if it expires in under 30 days.

I acked those.

indeed, it's auto-renewed by acme-chief, we should tune those checks.

The new cert has been issued already and it's being staged to avoid client-side clock skew issues:

Jul 15 12:00:02 acmechief1001 acme-chief-backend[3725]: Staging_time will be enforced for unified / rsa-2048 till 2021-07-22 08:02:06
Vgutierrez moved this task from Backlog to TLS on the Traffic board.
Vgutierrez added a project: good first task.

Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold? If traffic doesn't need those alerts we can just remove that nowadays.

@Vgutierrez: A good first task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributor. Given the current short task description I'm removing the good first task tag. Please add details what exactly has to happen where and how for a new contributor, and then add back the good first task project tag. Thanks a lot in advance!

Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold? If traffic doesn't need those alerts we can just remove that nowadays.

we should reduce the threshold, 3 weeks should be better for a LE acme-chief managed cert

ACK!

So.. we still want to monitor if TLS works on planet and phabricator, we just don't want to deal with cert expiry anymore. We need to create a new checkcommand probably. One of the less obvious parts to get right in Icinga/puppet, so no sure about the "good first task" but I will take it.

Dzahn renamed this task from Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 to (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46.Jul 15 2021, 1:45 PM
Dzahn claimed this task.

we should reduce the threshold, 3 weeks should be better for a LE acme-chief managed cert

But the planet cert (https://en.planet.wikimedia.org/ and other language subdomains of it) is still a DigiCert cert and not an Letsencrypt cert.

That's why the fix isn't just replacing the "check_ssl_http" with "check_ssl_http_letsencrypt" which it would have been if that was the case.

Same with the https://phabricator.wikimedia.org cert, it is still a DigiCert cert for me. So this is about adjusting the monitoring for that, the non-LE certs.

Does the traffic team want to keep the alerts for the DigiCert cert and should it be 30 days as before?

Change 706410 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] icinga/planet: use letsencrypt check command for https cert monitoring

https://gerrit.wikimedia.org/r/706410

Change 706410 merged by Dzahn:

[operations/puppet@production] icinga/planet: use letsencrypt check command for https cert monitoring

https://gerrit.wikimedia.org/r/706410

Thinking about this again I think we are good here now. One of the 2 checks was removed, the other stayed. So we do not have a duplicate alert anymore but we still keep one of them because otherwise we would have to monitoring for expiring non-LE certs. (and it depends on geo location which one is being presented to the clients and/or Icinga). Let me know if you disagree.