Make OCSP Stapling support more generic and robust
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Mar 25 2015, 6:42 PM

Description

The current update mechanism for OCSP Stapling has some built-in timing constants (fetch intervals, freshness checks in icinga, etc) which are based on the current 12 hour validity window we're observing from our primary production cert provider, GlobalSign. Globalsign could change their configuration and alter that time window at any time, or we could need to support other CAs. We need to add a daemon mode to the ocsp-update script which can manage this process better and dynamically choose retry/re-refetch intervals based on the latest received windows (which would also make it more robust against transient issues in general).

For the time being, with GlobalSign as our only prod cert provider, if they do make a window change before we're ready, we can work around it by modifying various timing-related parameters in puppet.

Details

Subject	Repo	Branch	Lines +/-
remove stapling_proxy patch	operations/software/nginx	wmf-1.11.4	+0 -226
add stapling-multi-file patch	operations/software/nginx	wmf-1.11.4	+337 -0
tlsproxy: support multiple independent staples	operations/puppet	production	+26 -10
Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages"	operations/puppet	production	+1 -1
Revert "tlsproxy: experimental support for internal ocsp"	operations/puppet	production	+0 -7
Add ssl_stapling_proxy patch	operations/software/nginx	wmf-1.11.4	+234 -0
tlsproxy: experimental support for internal ocsp	operations/puppet	production	+7 -0
update-ocsp: check response status	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T165455 Go from "E" to "A+" on Securityheaders.io
Declined	None	T92002 implement Public Key Pinning (HPKP) for Wikimedia domains
Resolved	BBlack	T148131 Deploy redundant unified certs
Resolved	BBlack	T93927 Make OCSP Stapling support more generic and robust
Duplicate	None	T148132 OCSP Stapling: support truly-independent ECC/RSA Certs+Staples
Resolved	BBlack	T148490 Extend check_sslxnn to check OCSP Stapling

Event Timeline

BBlack created this task.Mar 25 2015, 6:42 PM

BBlack claimed this task.

BBlack raised the priority of this task from to Medium.

BBlack updated the task description. (Show Details)

BBlack added a project: acl*sre-team.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2015, 6:42 PM

BBlack added a project: HTTPS.Mar 25 2015, 6:43 PM

BBlack set Security to None.

BBlack mentioned this in T86666: HTTPS performance tuning.

BBlack added a project: Traffic.Apr 22 2015, 2:41 PM

BBlack mentioned this in T97506: Use OCSP Stapling on misc cluster.Apr 29 2015, 1:57 PM

Dzahn raised the priority of this task from Medium to High.May 29 2015, 8:18 PM

Dzahn subscribed.

BBlack moved this task from Backlog to Upcoming on the Traffic board.Jun 3 2015, 2:08 PM

greg added a project: Incident-20150820-OCSP.Aug 20 2015, 5:48 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 20 2015, 5:48 PM

BBlack mentioned this in rOPUP367653244eeb: update-ocsp: refactor validation, check cert life.Aug 21 2015, 12:15 AM

Dzahn moved this task from Backlog to Big Picture on the HTTPS board.Dec 4 2015, 8:44 PM

BBlack moved this task from Upcoming to Backlog on the Traffic board.Dec 11 2015, 4:24 PM

BBlack mentioned this in T132835: cronspam from cpXXXX hosts due to update-ocsp-all and zero_fetch.Apr 16 2016, 11:23 AM

• ema subscribed.May 30 2016, 2:02 PM

greg added a project: Wikimedia-Incident.Jul 28 2016, 10:17 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Jul 28 2016, 10:18 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

BBlack moved this task from Backlog to TLS on the Traffic board.Sep 30 2016, 1:44 PM

This is still technically an outstanding issue that should be addressed, but it's relatively low priority with relatively low risk, at least so long as our prod certs continue to be issued exclusively by GlobalSign.

Note this particular task was already open before the related incident. Other more incident-specific ones were filed ( T109737, T109738) and closed with the merge of https://phabricator.wikimedia.org/rOPUP367653244eeba031b6d807e524abbcf66f5908d9 .

Arguably, if the link from the incident to this open ticket of broader scope is annoying, it could be removed from the incident itself.

In T93927#2684299, @BBlack wrote:

Arguably, if the link from the incident to this open ticket of broader scope is annoying, it could be removed from the incident itself.

Only annoying if you think me pinging once/quarter on old open tasks without much activity (in wikimedia-incident) is :)

Change 315670 had a related patch set uploaded (by BBlack):
update-ocsp: check response status

https://gerrit.wikimedia.org/r/315670

gerritbot added a project: Patch-For-Review.Oct 13 2016, 12:44 PM

Change 315670 merged by BBlack:
update-ocsp: check response status

https://gerrit.wikimedia.org/r/315670

BBlack added a parent task: T148132: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples.Oct 14 2016, 10:04 AM

BBlack removed a parent task: T148132: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples.Oct 14 2016, 10:13 AM

BBlack added a subtask: T148132: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples.

I've re-evaluated nginx's stapling support today. We last evaluated it deeply many versions ago and found that for various reasons we were better off using their ssl_stapling_file directive and managing the fetching/caching/validation of the staple data externally.

The present nginx internal stapling code looks significantly better. Basically, in terms of refresh intervals, validity checks, etc, it's all at least as good as our external stapling code if not better. The upsides where it's better are:

On upstream OCSP fetch failure, it will continue to poll upstream every 5 minutes instead of the usual 1h refresh window for valid data, whereas our current solution always polls once an hour.
Like our current solution it will continue serving the last-known-good OCSP response in the face of upstream failure, but unlike our current solution it will automatically stop doing so once the validity time window in that data expires.
It correctly handles the case of parallel RSA+ECC certs which have distinct OCSP responders and trust chains, whereas the ssl_stapling_file case does not.
Obviously, if we switch to their internal code, we get to kill a bunch of infrastructure code and complexity (which affects us on operational procedures as well), and resolve this task and its new child task without all the further significant work on that infrastructure/code.

The downsides to switching to the internal stapling code:

It still doesn't support proxying the upstream fetch through an HTTP proxy. We need this as our servers are in our internal network.
We lose OCSP cache persistence. Our current solution can keep last-known-good OCSP data on disk through nginx restarts or host restarts in the case of an extended upstream OCSP outage.
We lose some of our current visibility into whether upstream OCSP is failing - the current monitoring we do on OCSP staple file freshness, and the ability to look at on-disk files for debugging, etc.

On those points, I think the effort to fix (1) is probably reasonable as an upstream-able nginx patch, and a better use of our coding time than extending our current external python solution to address all its other needs. (2) is potentially patch-able as well (persisting to disk from nginx), but it's probably not worth the effort and complexity. If we have an upstream OCSP outage, we should be aware of it and not doing other maintenance, and the likelihood of other concurrent outages should be slim. (3) as well as the awareness mentioned in (2) should be fixable by properly monitoring our OCSP responses from icinga on a per-cache-host basis, like we do for basic SSL cert issues.

So, perhaps we should turn this task around and start figuring out how to transition to nginx's built-in stapling instead.

Change 315982 had a related patch set uploaded (by BBlack):
[WIP] preliminary ssl_stapling_proxy work

https://gerrit.wikimedia.org/r/315982

Change 316352 had a related patch set uploaded (by BBlack):
tlsproxy: experimental support for internal ocsp

https://gerrit.wikimedia.org/r/316352

Change 316352 merged by BBlack:
tlsproxy: experimental support for internal ocsp

https://gerrit.wikimedia.org/r/316352

https://pinkunicorn.wikimedia.org/ is now testing using nginx's internal stapling with the added ssl_stapling_proxy patch, via webproxy.eqiad.wmnet:8080

BBlack created subtask T148490: Extend check_sslxnn to check OCSP Stapling.Oct 18 2016, 2:43 AM

Change 315982 merged by BBlack:
Add ssl_stapling_proxy patch

https://gerrit.wikimedia.org/r/315982

Seems like the nginx-internal + ssl_stapling_proxy path is the way to go, so folding the other subtask back in (already taken care of in nginx).

BBlack closed subtask T148490: Extend check_sslxnn to check OCSP Stapling as Resolved.Oct 31 2016, 8:22 PM

BBlack added a parent task: T148131: Deploy redundant unified certs.

This task has continued to evolve. Basically, the remaining steps on the current path to resolution are:

Deploy https://gerrit.wikimedia.org/r/#/c/318971 (or something like it) which ensures actual OCSP stapling is monitored live from all the caches.
Test and deploy the currently-in-testing do_ocsp_int functionality (use nginx internal stapling)
After sufficient time to ensure we don't need a rollback, remove the existing external stapling scripts/data/cronjobs from the cache hosts.

In T93927#2717425, @BBlack wrote:

The downsides to switching to the internal stapling code:

It still doesn't support proxying the upstream fetch through an HTTP proxy. We need this as our servers are in our internal network.

We lose OCSP cache persistence. Our current solution can keep last-known-good OCSP data on disk through nginx restarts or host restarts in the case of an extended upstream OCSP outage.

We lose some of our current visibility into whether upstream OCSP is failing - the current monitoring we do on OCSP staple file freshness, and the ability to look at on-disk files for debugging, etc.

So, we've fixed (3) by actually monitoring our live OCSP outputs (yay!) with the check_ssl_unified work. We've got a patch for (1) that works. However, (2) is still an outstanding issue above, and on re-re-review of the nginx OCSP code there's a fourth issue as well:

It lets several responses go unstapled at the startup of each child process. To recap how that works: Each child does its own staple-fetching. When the first request that wants a staple comes into a process with no existing staple data, nginx will synchronously fetch the missing staple for that one client. Any other client requests that want stapling and arrive at this process before the initial fetch completes get responses without staples. This is going to happen every time we do config reloads and such: a spam of hundreds or even thousands of requests will slip through unstapled before the stapled response is ready.

The problems in 2 and 4 are related. Fixing persistence would probably fix the unstapled-startup case. However, actually patching all of that in is pretty unwieldy in terms of complexity.

So I've backtracked a bit on this whole topic and decided we should stick with the existing external-stapling mechanism for now. The only major feature it's missing at this point is supporting the independent stapling of parallel ECDSA+RSA certs which might have different intermediates (and/or vendors and/or OCSP URIs). Our external stapling script can handle that case, but nginx's ssl_stapling_file directive doesn't support multiple separate files (one per certificate). So to that end I'm now testing an nginx patch to make the nginx part work, here: https://gerrit.wikimedia.org/r/#/c/320115/3/debian/patches/0600-stapling-multi-file.patch

In preliminary testing on pinkunicorn it's working fine. I'll probably do more testing and revising before it's ready though, and roll it out together with a batch of other pending nginx updates for other unrelated issues.

Change 320113 had a related patch set uploaded (by BBlack):
remove stapling_proxy patch

https://gerrit.wikimedia.org/r/320113

Change 320115 had a related patch set uploaded (by BBlack):
add stapling-multi-file patch

https://gerrit.wikimedia.org/r/320115

Change 320696 had a related patch set uploaded (by BBlack):
Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages"

https://gerrit.wikimedia.org/r/320696

Change 320697 had a related patch set uploaded (by BBlack):
Revert "tlsproxy: experimental support for internal ocsp"

https://gerrit.wikimedia.org/r/320697

Change 320697 merged by BBlack:
Revert "tlsproxy: experimental support for internal ocsp"

https://gerrit.wikimedia.org/r/320697

Change 320696 merged by BBlack:
Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages"

https://gerrit.wikimedia.org/r/320696

Change 320704 had a related patch set uploaded (by BBlack):
tlsproxy: support multiple independent staples

https://gerrit.wikimedia.org/r/320704

Change 320704 merged by BBlack:
tlsproxy: support multiple independent staples

https://gerrit.wikimedia.org/r/320704

Change 320115 merged by BBlack:
add stapling-multi-file patch

https://gerrit.wikimedia.org/r/320115

Change 320113 merged by BBlack:
remove stapling_proxy patch

https://gerrit.wikimedia.org/r/320113

Mentioned in SAL (#wikimedia-operations) [2016-11-10T00:46:16Z] <bblack> nginx-1.11.4-1+wmf14 uploaded to carbon jessie-wikimedia (only deployed to cp1008 for now) - T93927 - T148917 - T144523

Stashbot mentioned this in T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT).Nov 10 2016, 12:46 AM

Stashbot mentioned this in T144523: OpenSSL 1.1 deployment for cache clusters.

We're in a good place here to close this ticket. To recap all OCSP related things:

We fetch OCSP data for our certs (with a separate, async tool on each cache node) once an hour, and our current provider (GlobalSign) gives us data that's valid for 4 days.
Our script rejects updates which are not completely-correct, so if the provider stops serving good data, we'll continue using the 4-day-TTL existing data.
Our script and the termination software (nginx) supports fully-independent stapling for ECDSA and RSA (which will be required with Digicert and other future scenarios).
We monitor (on every cache node) whether the files fetched above go stale, which will alert us to the fact that we're using valid-but-stale data during said 4-day window.
We're monitoring actual stapled OCSP response data over TLS from all individual cache termination nodes, which will alert to live OCSP failures. In the stale case, monitoring will also warn if the responses have < 3 days lifetime in them, and go critical if only 1 day of lifetime is left. This is somewhat redundant with the above, but it's important to monitor it at both levels just in case.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Make OCSP Stapling support more generic and robustClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Make OCSP Stapling support more generic and robust
Closed, ResolvedPublic
Actions

Related Objects
Search...