Although browsers will buffer NEL reports if they are unable to send them immediately, it's always better to receive them ASAP if we can, as they are often the first indication of an issue.
In the case of T292792 we immediately had both direct user reports and data available indicating trouble for eqiad users in the Americas, and action we took at 16:19 UTC mitigated the issue for them. However there were a large number of users in Russia and Kazakhstan who were affected, but who weren't able to send NEL reports until after 16:19 or possibly even later -- their route to esams was broken, causing the primary issue, and their 'next-best' datacenter used for NEL reports was eqiad, towards which their route was also broken. We also had no user reports of the issue until after 17:00 (potentially due to phab routing to eqiad).
It is possible to specify a group of endpoints to receive reports, each with a priority indicating the order in which to attempt transmission. The Network Error Logging working draft recommends this explicitly:
To improve delivery of NEL reports, the server should set report_to to an endpoint group containing at least one endpoint in an alternative origin whose infrastructure is not coupled with the origin from which the resource is being fetched — otherwise network errors cannot be reported until the problem is solved, if ever — and provide multiple endpoints to provide alternatives if some endpoints are unreachable.
Of course, actually receiving and processing the reports elsewhere would introduce many PII concerns.
Given that the issues we're interested in detecting quickly have to do with issues in intermediate networks which affect only a subset of users -- and that we can declare "complete loss of connectivity" to be out-of-scope as we have many other ways of detecting that -- I propose we do the following:
- On VMs on a few different public clouds, host a simple TCP proxy that listens on port 443 and proxies connections to our usual CDN edges (while being ignorant of anything at the TLS/HTTP level, no private keys, etc)
- Run those VMs under a different domain name (T292866, T263847) to avoid same-origin / cookie PII concerns. Use an external DNS provider as well.
- Provision our usual CDN with a LE cert for that domain, and map its backend to the same EventGate service as eventgate-logging-external
- In our Report-To header, list that domain as a secondary endpoint in our endpoint group.
The upshot of this:
- In the event that routing or connectivity is broken between some users and our IP space -- but not our IP space and at least one public cloud, and not between users and at least one public cloud -- we receive high-signal tcp.timed_out or dns.name_not_resolved reports within seconds, rather than not until after the outage is resolved. (Furthermore, I conjecture that exactly this flavor of network brokenness is most likely to be one that is actionable for us specifically, rather than something like a widespread outage at one ISP or in one geographic region, etc.)
- No PII such as URLs, User-Agent strings, or other such data is communicated in plaintext outside our infrastructure. (All that an observer can be sure of when noticing a user IP address accessing the naive forwarder is that a user at said IP address failed a fetch of some Wikimedia URL at some time in the past. I can't say there's zero potential here for leaking information, but it does seem rather limited.)