externally-hosted NEL report forwarders for more timely report reception
Open, LowPublic
Actions

Assigned To

None

Authored By

	CDanis
	Oct 8 2021, 7:11 PM

Description

Although browsers will buffer NEL reports if they are unable to send them immediately, it's always better to receive them ASAP if we can, as they are often the first indication of an issue.

In the case of T292792 we immediately had both direct user reports and data available indicating trouble for eqiad users in the Americas, and action we took at 16:19 UTC mitigated the issue for them. However there were a large number of users in Russia and Kazakhstan who were affected, but who weren't able to send NEL reports until after 16:19 or possibly even later -- their route to esams was broken, causing the primary issue, and their 'next-best' datacenter used for NEL reports was eqiad, towards which their route was also broken. We also had no user reports of the issue until after 17:00 (potentially due to phab routing to eqiad).

It is possible to specify a group of endpoints to receive reports, each with a priority indicating the order in which to attempt transmission. The Network Error Logging working draft recommends this explicitly:

To improve delivery of NEL reports, the server should set report_to to an endpoint group containing at least one endpoint in an alternative origin whose infrastructure is not coupled with the origin from which the resource is being fetched — otherwise network errors cannot be reported until the problem is solved, if ever — and provide multiple endpoints to provide alternatives if some endpoints are unreachable.

Of course, actually receiving and processing the reports elsewhere would introduce many PII concerns.

Given that the issues we're interested in detecting quickly have to do with issues in intermediate networks which affect only a subset of users -- and that we can declare "complete loss of connectivity" to be out-of-scope as we have many other ways of detecting that -- I propose we do the following:

On VMs on a few different public clouds, host a simple TCP proxy that listens on port 443 and proxies connections to our usual CDN edges (while being ignorant of anything at the TLS/HTTP level, no private keys, etc)
Run those VMs under a different domain name (T292866, T263847) to avoid same-origin / cookie PII concerns. Use an external DNS provider as well.
Provision our usual CDN with a LE cert for that domain, and map its backend to the same EventGate service as eventgate-logging-external
In our Report-To header, list that domain as a secondary endpoint in our endpoint group.

The upshot of this:

In the event that routing or connectivity is broken between some users and our IP space -- but not our IP space and at least one public cloud, and not between users and at least one public cloud -- we receive high-signal tcp.timed_out or dns.name_not_resolved reports within seconds, rather than not until after the outage is resolved. (Furthermore, I conjecture that exactly this flavor of network brokenness is most likely to be one that is actionable for us specifically, rather than something like a widespread outage at one ISP or in one geographic region, etc.)
No PII such as URLs, User-Agent strings, or other such data is communicated in plaintext outside our infrastructure. (All that an observer can be sure of when noticing a user IP address accessing the naive forwarder is that a user at said IP address failed a fetch of some Wikimedia URL at some time in the past. I can't say there's zero potential here for leaking information, but it does seem rather limited.)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		CDanis	T257527 automatically collect network error reports from users' browsers (Network Error Logging API)
		Open		None	T292870 externally-hosted NEL report forwarders for more timely report reception

Event Timeline

CDanis triaged this task as Low priority.Oct 8 2021, 7:11 PM

CDanis created this task.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptOct 8 2021, 7:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CDanis added a parent task: T257527: automatically collect network error reports from users' browsers (Network Error Logging API).Oct 8 2021, 7:11 PM

CDanis updated the task description. (Show Details)Oct 8 2021, 7:24 PM

BBlack moved this task from Backlog to Icebox-Temp on the Traffic board.Oct 8 2021, 7:58 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

I'd wary of the complexity of the setup.

As I'm not quite familiar with NEL setup, is there a downside of putting all our POPs in the "group of endpoints to receive reports", either randomly or in a fixed order (here latency doesn't matter)?
Or 2 IPs that do round robin DNS between the endpoints, etc.

Using a different domain name to prevent TLS censorship (.wikimedia.org, T292866 might be even better), and with a secondary DNS hosted externally if we worry about DNS being unreachable. That usually wouldn't go unnoticed.

The only usecase it won't protect us against is if all our IP spaces become unreachable from a subset of users, which is quite an uncommon case (eg. censorship) and wouldn't go unnoticed as well.
So if the proposal above is doable it might a good tradeoff.

In T292870#7415867, @ayounsi wrote:

I'd wary of the complexity of the setup.

Yeah, a fair criticism.

As I'm not quite familiar with NEL setup, is there a downside of putting all our POPs in the "group of endpoints to receive reports", either randomly or in a fixed order (here latency doesn't matter)?
Or 2 IPs that do round robin DNS between the endpoints, etc.

All you get to specify for each endpoint in a group is a URL (along with a numeric priority) -- so we'd either have to have per-site FQDNs (with a working TLS cert), or we'd have to have a round-robin A/AAAA record that resolved to multiple not-primary sites, etc.

(Offhand, I'm actually not sure whether or not Chromium's network stack is smart enough to automatically retry on the next address when connecting to the first address in a round-robin record times out or otherwise fails. If it does, I would feel pretty good about that.)

Using a different domain name to prevent TLS censorship (.wikimedia.org, T292866 might be even better), and with a secondary DNS hosted externally if we worry about DNS being unreachable. That usually wouldn't go unnoticed.

The only usecase it won't protect us against is if all our IP spaces become unreachable from a subset of users, which is quite an uncommon case (eg. censorship) and wouldn't go unnoticed as well.
So if the proposal above is doable it might a good tradeoff.

This is a really good thought, thanks! It's simple, yet a sizable improvement over the status quo, while also a necessary prereq for my more complex proposal.

Let's proceed with getting a specialized domain (T263847 / T292866) and then we can set up an external DNS provider with records pointing towards our existing IP space & CDN.

cmooney subscribed.Feb 28 2022, 8:18 PM

BBlack moved this task from Backlog to Complicated on the Traffic-Icebox board.Apr 7 2022, 9:05 PM

ayounsi removed a project: netops.Aug 26 2022, 1:09 PM

RhinosF1 subscribed.Aug 26 2022, 1:12 PM

externally-hosted NEL report forwarders for more timely report receptionOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

externally-hosted NEL report forwarders for more timely report reception
Open, LowPublic
Actions

Related Objects
Search...