Page MenuHomePhabricator

Make OCSP Stapling support more generic and robust
Closed, ResolvedPublic

Description

The current update mechanism for OCSP Stapling has some built-in timing constants (fetch intervals, freshness checks in icinga, etc) which are based on the current 12 hour validity window we're observing from our primary production cert provider, GlobalSign. Globalsign could change their configuration and alter that time window at any time, or we could need to support other CAs. We need to add a daemon mode to the ocsp-update script which can manage this process better and dynamically choose retry/re-refetch intervals based on the latest received windows (which would also make it more robust against transient issues in general).

For the time being, with GlobalSign as our only prod cert provider, if they do make a window change before we're ready, we can work around it by modifying various timing-related parameters in puppet.

Event Timeline

BBlack claimed this task.
BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added a project: acl*sre-team.
Dzahn raised the priority of this task from Medium to High.May 29 2015, 8:18 PM
Dzahn subscribed.

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

This is still technically an outstanding issue that should be addressed, but it's relatively low priority with relatively low risk, at least so long as our prod certs continue to be issued exclusively by GlobalSign.

Note this particular task was already open before the related incident. Other more incident-specific ones were filed ( T109737, T109738) and closed with the merge of https://phabricator.wikimedia.org/rOPUP367653244eeba031b6d807e524abbcf66f5908d9 .

Arguably, if the link from the incident to this open ticket of broader scope is annoying, it could be removed from the incident itself.

Arguably, if the link from the incident to this open ticket of broader scope is annoying, it could be removed from the incident itself.

Only annoying if you think me pinging once/quarter on old open tasks without much activity (in wikimedia-incident) is :)

Change 315670 had a related patch set uploaded (by BBlack):
update-ocsp: check response status

https://gerrit.wikimedia.org/r/315670

Change 315670 merged by BBlack:
update-ocsp: check response status

https://gerrit.wikimedia.org/r/315670

I've re-evaluated nginx's stapling support today. We last evaluated it deeply many versions ago and found that for various reasons we were better off using their ssl_stapling_file directive and managing the fetching/caching/validation of the staple data externally.

The present nginx internal stapling code looks significantly better. Basically, in terms of refresh intervals, validity checks, etc, it's all at least as good as our external stapling code if not better. The upsides where it's better are:

  1. On upstream OCSP fetch failure, it will continue to poll upstream every 5 minutes instead of the usual 1h refresh window for valid data, whereas our current solution always polls once an hour.
  2. Like our current solution it will continue serving the last-known-good OCSP response in the face of upstream failure, but unlike our current solution it will automatically stop doing so once the validity time window in that data expires.
  3. It correctly handles the case of parallel RSA+ECC certs which have distinct OCSP responders and trust chains, whereas the ssl_stapling_file case does not.
  4. Obviously, if we switch to their internal code, we get to kill a bunch of infrastructure code and complexity (which affects us on operational procedures as well), and resolve this task and its new child task without all the further significant work on that infrastructure/code.

The downsides to switching to the internal stapling code:

  1. It still doesn't support proxying the upstream fetch through an HTTP proxy. We need this as our servers are in our internal network.
  2. We lose OCSP cache persistence. Our current solution can keep last-known-good OCSP data on disk through nginx restarts or host restarts in the case of an extended upstream OCSP outage.
  3. We lose some of our current visibility into whether upstream OCSP is failing - the current monitoring we do on OCSP staple file freshness, and the ability to look at on-disk files for debugging, etc.

On those points, I think the effort to fix (1) is probably reasonable as an upstream-able nginx patch, and a better use of our coding time than extending our current external python solution to address all its other needs. (2) is potentially patch-able as well (persisting to disk from nginx), but it's probably not worth the effort and complexity. If we have an upstream OCSP outage, we should be aware of it and not doing other maintenance, and the likelihood of other concurrent outages should be slim. (3) as well as the awareness mentioned in (2) should be fixable by properly monitoring our OCSP responses from icinga on a per-cache-host basis, like we do for basic SSL cert issues.

So, perhaps we should turn this task around and start figuring out how to transition to nginx's built-in stapling instead.

Change 315982 had a related patch set uploaded (by BBlack):
[WIP] preliminary ssl_stapling_proxy work

https://gerrit.wikimedia.org/r/315982

Change 316352 had a related patch set uploaded (by BBlack):
tlsproxy: experimental support for internal ocsp

https://gerrit.wikimedia.org/r/316352

Change 316352 merged by BBlack:
tlsproxy: experimental support for internal ocsp

https://gerrit.wikimedia.org/r/316352

https://pinkunicorn.wikimedia.org/ is now testing using nginx's internal stapling with the added ssl_stapling_proxy patch, via webproxy.eqiad.wmnet:8080

Change 315982 merged by BBlack:
Add ssl_stapling_proxy patch

https://gerrit.wikimedia.org/r/315982

BBlack subscribed.

Seems like the nginx-internal + ssl_stapling_proxy path is the way to go, so folding the other subtask back in (already taken care of in nginx).

This task has continued to evolve. Basically, the remaining steps on the current path to resolution are:

  1. Deploy https://gerrit.wikimedia.org/r/#/c/318971 (or something like it) which ensures actual OCSP stapling is monitored live from all the caches.
  2. Test and deploy the currently-in-testing do_ocsp_int functionality (use nginx internal stapling)
  3. After sufficient time to ensure we don't need a rollback, remove the existing external stapling scripts/data/cronjobs from the cache hosts.

The downsides to switching to the internal stapling code:

  1. It still doesn't support proxying the upstream fetch through an HTTP proxy. We need this as our servers are in our internal network.
  2. We lose OCSP cache persistence. Our current solution can keep last-known-good OCSP data on disk through nginx restarts or host restarts in the case of an extended upstream OCSP outage.
  3. We lose some of our current visibility into whether upstream OCSP is failing - the current monitoring we do on OCSP staple file freshness, and the ability to look at on-disk files for debugging, etc.

So, we've fixed (3) by actually monitoring our live OCSP outputs (yay!) with the check_ssl_unified work. We've got a patch for (1) that works. However, (2) is still an outstanding issue above, and on re-re-review of the nginx OCSP code there's a fourth issue as well:

  1. It lets several responses go unstapled at the startup of each child process. To recap how that works: Each child does its own staple-fetching. When the first request that wants a staple comes into a process with no existing staple data, nginx will synchronously fetch the missing staple for that one client. Any other client requests that want stapling and arrive at this process before the initial fetch completes get responses without staples. This is going to happen every time we do config reloads and such: a spam of hundreds or even thousands of requests will slip through unstapled before the stapled response is ready.

The problems in 2 and 4 are related. Fixing persistence would probably fix the unstapled-startup case. However, actually patching all of that in is pretty unwieldy in terms of complexity.

So I've backtracked a bit on this whole topic and decided we should stick with the existing external-stapling mechanism for now. The only major feature it's missing at this point is supporting the independent stapling of parallel ECDSA+RSA certs which might have different intermediates (and/or vendors and/or OCSP URIs). Our external stapling script can handle that case, but nginx's ssl_stapling_file directive doesn't support multiple separate files (one per certificate). So to that end I'm now testing an nginx patch to make the nginx part work, here: https://gerrit.wikimedia.org/r/#/c/320115/3/debian/patches/0600-stapling-multi-file.patch

In preliminary testing on pinkunicorn it's working fine. I'll probably do more testing and revising before it's ready though, and roll it out together with a batch of other pending nginx updates for other unrelated issues.

Change 320113 had a related patch set uploaded (by BBlack):
remove stapling_proxy patch

https://gerrit.wikimedia.org/r/320113

Change 320115 had a related patch set uploaded (by BBlack):
add stapling-multi-file patch

https://gerrit.wikimedia.org/r/320115

Change 320696 had a related patch set uploaded (by BBlack):
Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages"

https://gerrit.wikimedia.org/r/320696

Change 320697 had a related patch set uploaded (by BBlack):
Revert "tlsproxy: experimental support for internal ocsp"

https://gerrit.wikimedia.org/r/320697

Change 320697 merged by BBlack:
Revert "tlsproxy: experimental support for internal ocsp"

https://gerrit.wikimedia.org/r/320697

Change 320696 merged by BBlack:
Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages"

https://gerrit.wikimedia.org/r/320696

Change 320704 had a related patch set uploaded (by BBlack):
tlsproxy: support multiple independent staples

https://gerrit.wikimedia.org/r/320704

Change 320704 merged by BBlack:
tlsproxy: support multiple independent staples

https://gerrit.wikimedia.org/r/320704

Change 320115 merged by BBlack:
add stapling-multi-file patch

https://gerrit.wikimedia.org/r/320115

Change 320113 merged by BBlack:
remove stapling_proxy patch

https://gerrit.wikimedia.org/r/320113

Mentioned in SAL (#wikimedia-operations) [2016-11-10T00:46:16Z] <bblack> nginx-1.11.4-1+wmf14 uploaded to carbon jessie-wikimedia (only deployed to cp1008 for now) - T93927 - T148917 - T144523

We're in a good place here to close this ticket. To recap all OCSP related things:

  • We fetch OCSP data for our certs (with a separate, async tool on each cache node) once an hour, and our current provider (GlobalSign) gives us data that's valid for 4 days.
  • Our script rejects updates which are not completely-correct, so if the provider stops serving good data, we'll continue using the 4-day-TTL existing data.
  • Our script and the termination software (nginx) supports fully-independent stapling for ECDSA and RSA (which will be required with Digicert and other future scenarios).
  • We monitor (on every cache node) whether the files fetched above go stale, which will alert us to the fact that we're using valid-but-stale data during said 4-day window.
  • We're monitoring actual stapled OCSP response data over TLS from all individual cache termination nodes, which will alert to live OCSP failures. In the stale case, monitoring will also warn if the responses have < 3 days lifetime in them, and go critical if only 1 day of lifetime is left. This is somewhat redundant with the above, but it's important to monitor it at both levels just in case.