This was originally on our long-term radar as part of the (forever-stalled and in-discussion!) [H]PKP ticket: T92002 . The recent GlobalSign issue has highlighted the need to break this out as a higher-priority action we should take on independently of that.
We need to obtain our "unified" cert from two vendors with the same SAN set, in both ECC and RSA forms. Ideally the annual renewal time for each should be at least slightly offset (~1 month?). We'll puppetize the deployment of both keys to all of the cache clusters, including live OCSP staple fetching for both from everywhere.
We'll puppetize such that VendorA's certs are live in one set of datacenters and VendorB's are live in another under normal conditions. With both in active use, we'll be ensured they're both normally working properly on fine details like browser compatibility, OCSP, PKP, etc. By splitting on regions (rather than other arbitrary splits), we avoid issues with individual clients commonly bouncing between two disparate certs and the effect that may have on performance-related issues.
if we run into another rare operational issue affecting one of the active cert vendors, with a very trivial puppet change (just a 2-3 line nginx config change + nginx reload) we can switch to the remaining functional cert at all datacenters.
Vendor selection is out of scope in this ticket, but essentially we need to select two separate, independent vendors (no shared trust chain) we can trust which meet all of our operational needs (especially: easy issue of large SAN lists with multiple wildcards and dual-issue of ECC+RSA certs).
What is in scope for this ticket is making the actual changes necessary to deploy the dual-vendor keys after we've purchased them, and documenting a simple procedure for switching them.