Several people have asked me what we're missing before we can start serving the unified cert that acme-chief generates in production to end users. After speaking to Valentin we think that all the technical stuff is ready (note the puppet stuff is all there AFAIK - we use it in beta to serve wildcard certs for the main text cache stuff).
If we have all the technical stuff, let's talk about what we're missing to be able to do this?
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | Vgutierrez | T230687 Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users | |||
Resolved | Vgutierrez | T234803 Provide an easy way of picking the traffic serving TLS certificate used by ATS |
Event Timeline
There's perhaps a faulty implicit assumption here that we desire to use one cert for the world and that we'd just "switch" everything to LE. We're currently using the Globalsign cert at all edges due to various problems earlier in the year, but what we were doing in the past and would like to continue doing in the future is using two certs simultaneously from unrelated CAs, and making the split on a per-datacenter basis (with the US sites using GlobalSign, and the non-US sites using LE, in this case).
The rationale for keeping multiple CAs in live use in general is redundancy on upstream OCSP services we rely on from the CAs, and general safety against any other CA-level problem (e.g. unintended or undesirable revocations, etc). The reason we want both of them active at the same time is so that we can have confidence that both are functioning properly. The reason we split between US and non-US sites is that geoip resolution is flappier within the US and we don't want some edge-case UAs seeing a constantly-changing cert. The reason we default to choosing GlobalSign on the US side and LE (or Digicert, historically) on the non-US side is the LE/Digicert OCSP responses are much smaller than GlobalSign's, and US clients get better performance out of the site in general and thus can better tolerate the minor perf impact of GS's larger OCSP staples.
The existing puppetization for the per-datacenter certificate vendor selection was written in an era when we only had manually-issued vendor certs, and is going to need some updating to handle smoothly switching (per-datacenter, and during emergencies) between the disparate filesystem paths of the manual and LE certs. @Vgutierrez may have some ideas about how to tackle these, but it's behind other priorities at present (We could manually switch in the LE certs globally in an OCSP service emergency, if that were necessary before this puppetization work were done). We'll probably wait to tackle his until our TLS termination has finished switching over to our new ATS implementation, since that's close on the horizon and the existing puppetization is nginx-based - (T221594 and related).
Updates on this last bit: T234803 is sorting out these technical details, and we probably don't have to wait for the full ATS-TLS transition, either. We're getting close over there, so I'll just make it a dependency of this task (once we're done over there, we can have the final patch to actually turn on LE usage at 1 or 2 edges land over here to close this).
Change 575305 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Switch unified cert vendor to Let's Encrypt on ulsfo
Change 575305 merged by Vgutierrez:
[operations/puppet@production] ATS: Switch unified cert vendor to Let's Encrypt on ulsfo
Mentioned in SAL (#wikimedia-operations) [2020-03-02T13:53:33Z] <vgutierrez> Switch from globalsign to LE as unified cert vendor on cp4026 - T230687
Mentioned in SAL (#wikimedia-operations) [2020-03-02T13:55:38Z] <vgutierrez> Switch from globalsign to LE as unified cert vendor on ulsfo - T230687
$ openssl s_client -connect upload-lb.ulsfo.wikimedia.org:443 2>&1 < /dev/null |openssl x509 -noout -issuer -dates issuer= /C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 notBefore=Jan 22 07:03:51 2020 GMT notAfter=Apr 21 07:03:51 2020 GMT $ openssl s_client -connect text-lb.ulsfo.wikimedia.org:443 2>&1 < /dev/null |openssl x509 -noout -issuer -dates issuer= /C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 notBefore=Jan 22 07:03:51 2020 GMT notAfter=Apr 21 07:03:51 2020 GMT
Change 576188 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Switch unified cert vendor to Let's Encrypt on eqiad & codfw
Change 576188 merged by Vgutierrez:
[operations/puppet@production] ATS: Switch unified cert vendor to Let's Encrypt on eqiad & codfw
Mentioned in SAL (#wikimedia-operations) [2020-03-03T06:25:31Z] <vgutierrez> Switch from globalsign to LE as unified cert vendor on codfw - T230687
Mentioned in SAL (#wikimedia-operations) [2020-03-03T06:33:23Z] <vgutierrez> Switch from globalsign to LE as unified cert vendor on eqiad - T230687
The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!
@Vgutierrez It looks like the work you've done means that this can be closed. Is that the case?