Page MenuHomePhabricator

Certificate *.wikipedia.org valid until 2020-06-20
Closed, DeclinedPublic

Description

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=en.planet.wikimedia.org&service=HTTPS-planet
has been alerting for 13 days.

HTTPS-planet - WARNING 2020-05-04 07:29:29 13d 0h 28m 4s 3/3 SSL WARNING - Certificate *.wikipedia.org valid until 2020-06-20 07:01:41 +0000 (expires in 46 days)

There is also no runbook for that alert on the linked page: https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org

Also CC @RobH as you did certs renewals in the past but not sure who own Planet :)

Event Timeline

*.wmfusercontent.org and *.planet.wikimedia.org are SANs of the unified cert. Currently we're using the LE unified cert on the US DCs (codfw, eqiad and ulsfo). LE certs are valid for 90 days, I think we need to adjust those icinga checks. The cert will be automatically renewed in 16 days (2020-05-16) and it will replace the current one 7 days later to avoid clock skew issues. So no need to bother @RobH for this one :)

This isn't specific to planet or about who owns planet, this is the general *.wikipedia.org cert.

Dzahn renamed this task from en.planet.wikimedia.org - Certificate *.wikipedia.org valid until 2020-06-20 to Certificate *.wikipedia.org valid until 2020-06-20.May 4 2020, 7:56 AM

Change 594103 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: replace check_ssl_http with check_ssl_http_letsencrypt

https://gerrit.wikimedia.org/r/594103

Dzahn removed a subscriber: RobH.
colewhite triaged this task as Medium priority.May 4 2020, 11:16 PM
colewhite added a subscriber: RobH.

Change 594103 merged by Dzahn:
[operations/puppet@production] icinga: replace check_ssl_http with check_ssl_http_letsencrypt

https://gerrit.wikimedia.org/r/594103

Currently we're using the LE unified cert on the US DCs (codfw, eqiad and ulsfo). LE certs are valid for 90 days, I think we need to adjust those icinga checks.

The change above i just merged should have fixed the main issue. Replacing check_ssl_http with check_ssl_http_letsencrypt for these 2 services. These check commands are identical except the thresholds for CRIT and WARN. They change from WARN 60 / CRIT 30 to WARN 7 / CRIT 3 with this.

Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the expiration of that cert. We do, right?

Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the expiration of that cert. We do, right?

It's a good fix "for now", but that was the primary automated monitoring for the manual once-a-year renewals, too. Probably we need to unify these two checks to where the criteria can differ depending on the nature of the cert, or perhaps change the warning to be based on a percentage of the cert's total lifespan instead of a fixed count of days, or something like that.

Or we could make a new Icinga check that isn't check_http for a specific service but runs openssl directly on the cert file in the private repo and has a generic name like "wikipedia unified cert".

Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the expiration of that cert. We do, right?

the icinga check on cp hosts currently warns 30 days before and goes critical 15 days before cert expiration. IMHO 7 / 3 is not enough for the unified cert even when LE is the issuer considering our anti clock skew measures and that acme-chief should issue the new cert 30 days before the valid one expires

Change 594722 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: increase tresholds for check_ssl_http_letsencrypt

https://gerrit.wikimedia.org/r/594722

the icinga check on cp hosts currently warns 30 days before and goes critical 15 days before cert expiration. IMHO 7 / 3 is not enough for the unified cert even when LE is the issuer considering our anti clock skew measures and that acme-chief should issue the new cert 30 days before the valid one expires

@Vgutierrez Is this good? -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/594722

if every LE certificate checked by that icinga check it's issued by acme-chief then yes, it's good

IMHO 7 / 3 is not enough for the unified cert even when LE is the issuer considering our anti clock skew measures and that acme-chief should issue the new cert 30 days before the valid one expires

I reverted the change for now to how it was before. Until we have a better fix. We would still have to check if "if every LE certificate checked by that icinga check it's issued by acme-chief" is actually true or not.

ACKing the alerts again with that task as comment.

can we close this task or at least change the task title to lfocus on the icinga alerts? there is no issue with cert renewal itself :)

willikins:puppet vgutierrez$ openssl s_client -connect text-lb.eqiad.wikimedia.org:443 2>/dev/null < /dev/null |openssl x509 -noout -dates
notBefore=May 21 09:53:05 2020 GMT
notAfter=Aug 19 09:53:05 2020 GMT

Yes, it should be renamed. But i think it is traffic team's decision what to do about the monitoring per this being the " primary automated monitoring for the manual once-a-year renewals". The services this is on are just one (random) example of what is on these certs.

Change 594722 abandoned by Dzahn:
icinga: increase tresholds for check_ssl_http_letsencrypt

https://gerrit.wikimedia.org/r/594722