Page MenuHomePhabricator

Deploy redundant unified certs
Closed, ResolvedPublic

Description

This was originally on our long-term radar as part of the (forever-stalled and in-discussion!) [H]PKP ticket: T92002 . The recent GlobalSign issue has highlighted the need to break this out as a higher-priority action we should take on independently of that.

We need to obtain our "unified" cert from two vendors with the same SAN set, in both ECC and RSA forms. Ideally the annual renewal time for each should be at least slightly offset (~1 month?). We'll puppetize the deployment of both keys to all of the cache clusters, including live OCSP staple fetching for both from everywhere.

We'll puppetize such that VendorA's certs are live in one set of datacenters and VendorB's are live in another under normal conditions. With both in active use, we'll be ensured they're both normally working properly on fine details like browser compatibility, OCSP, PKP, etc. By splitting on regions (rather than other arbitrary splits), we avoid issues with individual clients commonly bouncing between two disparate certs and the effect that may have on performance-related issues.

if we run into another rare operational issue affecting one of the active cert vendors, with a very trivial puppet change (just a 2-3 line nginx config change + nginx reload) we can switch to the remaining functional cert at all datacenters.

Vendor selection is out of scope in this ticket, but essentially we need to select two separate, independent vendors (no shared trust chain) we can trust which meet all of our operational needs (especially: easy issue of large SAN lists with multiple wildcards and dual-issue of ECC+RSA certs).

What is in scope for this ticket is making the actual changes necessary to deploy the dual-vendor keys after we've purchased them, and documenting a simple procedure for switching them.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack mentioned this in Unknown Object (Task).Oct 14 2016, 10:29 AM

I was actually thinking the same for keeping both certs live. One way to get around the subtle differences/coalesce issue etc. is to deploy them in different regions. esams/ulsfo could get vendor A, eqiad/codfw vendor B (or some other combination, it doesn't matter all that much).

Having the "backup" vendor active somewhere would allow us to have it all tested out, with puppet conditionals that switch things out, us being sure that our SANs are correct, that the cert is valid per our HPKP, that there aren't any strange UA incompatibilities somewhere etc. etc. It's not huge, but it's a plus, and doing it by region may be simple enough with not any gotchas I can think of.

Yeah, regional split might make sense. We probably don't want to mix within the US, where we might see "bouncy" GeoIP resolution. Perhaps one for all US sites and one for all non-US sites (initially just esams, to include Asia when it comes online)?

Yeah — esams by itself gets enough (and diverse enough) traffic that it should suffice.

Status update - Digicert unified certs (RSA+ECDSA) are now deployed and stapled alongside the GlobalSign ones on all cache terminators. They're not being used for user traffic, but they're ready as a warm standby in the case that we need to deal with another issue like the past GlobalSign OCSP/revocation incident.

We'll defer doing the reconfiguration to actually make them user-facing (with the US sites on Digicert and the international one(s) on GlobalSign) after the holiday period is over, in early January.

In case such an incident happens before the changes in January and I'm not around, the procedure to switch GlobalSign to Digicert globally would be:

  1. Commit a change to modules/role/manifests/cache/ssl/unified.pp changing the $certs_active set to the digicert ones listed above it in $certs.
  2. Do a salted puppet run to all caches to deploy the config change quickly (e.g. salt -b 500 -v -t 10 -G "cluster:cache_*" cmd.run "puppet agent -t")

These are now deployed (digicert in esams, globalsign elsewhere). Pending closing this until we document switching off either of the certs...

As part of ops clinic duty, I've been reviewing all high priority tasks with no owner and seeing if we can either assign to someone, or get attention for them.

These are now deployed (digicert in esams, globalsign elsewhere). Pending closing this until we document switching off either of the certs...

I'm guessing this means by @BBlack, though this could be handled by someone else in traffic? @ema or @ayounsi perhaps?

Otherwise this seems to be low hanging fruit, only requiring the documentation of our existing services to resolve.