Page MenuHomePhabricator

Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users
Open, Stalled, MediumPublic

Description

Several people have asked me what we're missing before we can start serving the unified cert that acme-chief generates in production to end users. After speaking to Valentin we think that all the technical stuff is ready (note the puppet stuff is all there AFAIK - we use it in beta to serve wildcard certs for the main text cache stuff).
If we have all the technical stuff, let's talk about what we're missing to be able to do this?

Event Timeline

Krenair created this task.Aug 18 2019, 12:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 18 2019, 12:47 PM
BBlack changed the task status from Open to Stalled.Aug 19 2019, 5:31 PM
BBlack assigned this task to Vgutierrez.

There's perhaps a faulty implicit assumption here that we desire to use one cert for the world and that we'd just "switch" everything to LE. We're currently using the Globalsign cert at all edges due to various problems earlier in the year, but what we were doing in the past and would like to continue doing in the future is using two certs simultaneously from unrelated CAs, and making the split on a per-datacenter basis (with the US sites using GlobalSign, and the non-US sites using LE, in this case).

The rationale for keeping multiple CAs in live use in general is redundancy on upstream OCSP services we rely on from the CAs, and general safety against any other CA-level problem (e.g. unintended or undesirable revocations, etc). The reason we want both of them active at the same time is so that we can have confidence that both are functioning properly. The reason we split between US and non-US sites is that geoip resolution is flappier within the US and we don't want some edge-case UAs seeing a constantly-changing cert. The reason we default to choosing GlobalSign on the US side and LE (or Digicert, historically) on the non-US side is the LE/Digicert OCSP responses are much smaller than GlobalSign's, and US clients get better performance out of the site in general and thus can better tolerate the minor perf impact of GS's larger OCSP staples.

The existing puppetization for the per-datacenter certificate vendor selection was written in an era when we only had manually-issued vendor certs, and is going to need some updating to handle smoothly switching (per-datacenter, and during emergencies) between the disparate filesystem paths of the manual and LE certs. @Vgutierrez may have some ideas about how to tackle these, but it's behind other priorities at present (We could manually switch in the LE certs globally in an OCSP service emergency, if that were necessary before this puppetization work were done). We'll probably wait to tackle his until our TLS termination has finished switching over to our new ATS implementation, since that's close on the horizon and the existing puppetization is nginx-based - (T221594 and related).

ema moved this task from Triage to TLS on the Traffic board.Aug 27 2019, 10:09 AM
ema triaged this task as Medium priority.Sep 5 2019, 3:20 PM

@Vgutierrez may have some ideas about how to tackle these, but it's behind other priorities at present (We could manually switch in the LE certs globally in an OCSP service emergency, if that were necessary before this puppetization work were done). We'll probably wait to tackle his until our TLS termination has finished switching over to our new ATS implementation, since that's close on the horizon and the existing puppetization is nginx-based - (T221594 and related).

Updates on this last bit: T234803 is sorting out these technical details, and we probably don't have to wait for the full ATS-TLS transition, either. We're getting close over there, so I'll just make it a dependency of this task (once we're done over there, we can have the final patch to actually turn on LE usage at 1 or 2 edges land over here to close this).