Page MenuHomePhabricator

TLS certificates renewal process
Closed, ResolvedPublic

Description

I was reading this study Google published based on data collected from Chrome:
https://ai.google/research/pubs/pub46359

My main takeaway from it is that:

  1. We should consider renewing certificates around 3 months before they expire (for wikis / Tier 1 services.)

Reason being that in the case of disaster and we're unable to look after the servers for a while or if problems arise with the certificate authority, it'd be nice to have some leeway before audiences are affected and making content inaccessible.

  1. We should make sure that at least 24 hours pass before actively using a newly issued certificate (unless it's a disaster recovery).

Reason being that clock skew is not uncommon and 24h is amble buffer to accomodate 93.3% of clients. Looking at the shape of the graph in detail, the tipping point where the percentage of users still rises significantly before becoming flat, waiting 4-5 days would get us 94% of the remaining, which amounts to 99.6% (=93.3+(6.7*0.94)).

After 24h, browsers can reasonably detect the issue and alert the user of it. Although if we can, I suppose we could aim to wait 5 days by default.

Thoughts? What is our current process?

Event Timeline

Speaking for the big unified certs we get from commercial vendors: we generally do wait ~24h (usually longer?) , between the issue date of new major certs and their deployment, but it's more unspoken general best practices than a documented policy.

We have two separate vendors we purchase the certs from in duplicate (GlobalSign and Digicert), to avoid SPOF on their runtime OCSP Stapling services throughout the year, and also in case of other vendor/renewal problems. We should be renewing in ample time in general (~1 month out?), but it gets hard to balance extreme ordering leads with the desire to keep the lifetimes low and not cheat ourselves out of our year's worth of bits. We stagger the renewals as a hedge against vendor-specific renewal issues, which is at least as good a hedge as starting ~3 months out. We deploy both of them in parallel to make sure they're both live and working under normal conditions: GlobalSign to the US sites and Digicert to the non-US sites (to avoid odds of cert choice flapping too much for a given user), and a puppet commit can switch all the sites to just one or the other, in case of a failing vendor.

Data from last time around, digging through commit logs and cert outputs:

VendorNotBeforeNotAfterSwapped into active use
GlobalSign2017-11-03 03:422018-11-22 07:592017-11-06 16:25
Digicert2017-12-21 00:002019-01-24 00:002017-12-22 14:45

So the NotBefore -> deploy windows for clock skew last time around were ~3.5 days for GlobalSign (decent, could be better) and ~1.5 days for Digicert (acceptable, but not ideal). The staggering between the two issue dates was around 47 days last time. We've been shifting those a bit further apart each year, and the next upcoming expiries are now roughly 63 days apart.

As for the rest, especially with the one-offs using LetsEncrypt scripting today, we definitely don't have this kind of resiliency, or any kind of deployment lead time built in. We should at least fix the deployment lead-time clock skew problem in the new solution somehow (delay deployment by N days after issuance, unless it's a brand-new cert with only a self-signed/nothing preceding it for replacement).

Vvjjkkii renamed this task from TLS certificates renewal process to zrbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii raised the priority of this task from Low to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
BBlack mentioned this in Unknown Object (Task).Oct 12 2018, 3:48 PM

FYI I opened a feature request on certbot to propose a delay before deployment as stated here, and will soon propose a patch there.

I don't think we use certbot anywhere except maybe Gerrit.

This ticket hasn't been updated since the acme-chief deployment, which is now being used for the unified cert, as well as the miscellaneous sites.
The unified one is configured to stage new certs from LE for a week before they should be served to users.

Since they're LE certs they only have 3 months validity in the first place, looks like we aim to renew when it's got 30 days left, but there's currently a DigiCert unified cert that could be used if it broke. (Looks like there wasn't a DigiCert issued cert available between 6th October and 2nd November though there is a GlobalSign one good through to 22nd November)

@BBlack Based on the three references you've made to this ticket over the past two years, I guess this has de-facto been accepted as-is. Should we document this somewhere in a runbook or other policy of sorts?

BBlack claimed this task.

Added a section to https://wikitech.wikimedia.org/wiki/HTTPS about renewal which mentions aging out new manually-issued certs