Page MenuHomePhabricator

Avoid extra HTTPS connections for most Event Platform beacons
Open, LowPublic

Description

Writing this in a fairly general form assuming the title/description will be edited to be more general since this doesn't exactly relate to Event Platform stuff, even though that's where it arose.

https://phabricator.wikimedia.org/T261340, proposes to resolve some URLs (in this specific case, https://intake-logging.wikimedia.org) to a different datacenter than normal. In particular here, different from the legacy /beacon endpoint, or the https://intake-analytics.wikimedia.org endpoint, etc.

This ticket is based on an observation @Krinkle made in https://phabricator.wikimedia.org/T226986#6467370

For EventLogging, we specifically moved away from separate domains to using /beacon so that the majority of non-deferred events that are sent off during user interactions don't require separate connections to be established etc. It has been a while since we last quantified the benefits of this choice, so it's certainly worth revisiting.

I see that EventGate, which is not used yet for most events, uses a separate domain again at https://intake-logging.wikimedia.org. That'll involve a DNS query, but given it points to dyna.wikimedia.org same as text-lb, I'm assuming this means it is handled through the same connection and traffic layer as other requests.

Again, per @Krinkle:

Starting to establish more than one primary connection on a majority of page views is something we phased out 5+ years ago and would be great not to bring back without further research and consideration first.

Event Timeline

In https://phabricator.wikimedia.org/T226986#6467482 I wrote:

Huh, I'm pretty sure I discussed this with @BBlack or @ema when we were first setting up eventgate-analytics-external, and they preferred that the intake service got its own unique URL, rather than serving it in the wiki domains.

I can't find a public discussion of this in Phab, just https://phabricator.wikimedia.org/T233629#557637, so we must have discussed it in IRC or elsewhere.

Also relevant: T261340: 'skip_first' feature flag for gdnsd GeoIP plugin
Chris is making use of the fact that there is a separate endpoint to route logging events to the next nearest datacenter in case there is something wrong with the route to the nearest datacenter.

To instrument this, and gauge any background/side impact during page load, I'd recommend creating two speed-tests scenarios under wikipedia.org/speed-tests/ where one is simple like the current Banksy page, and another that performs a handful of sendBeacon() calls from inline scripts against a domain that requires a separate DNS resolution and TLS/TCP connection. Either intake-logging is it has already been configured this way by now, or a any different production URL that doesn't sharre the same dns target and tls cert with the canonical wiki domains.

Then to add both scenarios to our synthetic test config for a few days to compare them side-by-side:
https://wikitech.wikimedia.org/wiki/Performance/WebPageTest#Add_a_new_URL_to_test

Krinkle renamed this task from Research and consider network connections made due to Event Platform to Avoid extra HTTPS connections for most Event Platform beacons.Oct 3 2022, 9:25 PM

Updated title to reflect to recognise that the original one of these (NEL: Network Error Logging) was intentionally done as a separate domain and DC for logicial seperation of networking problems when trying to have the browser notify us of networking problems.

However, for all first-party and in-page use of EventLogging/EventGate, this doesn't apply.

@phuedx @dr0ptp4kt I assume the new beacon endpoint (/beacon/v2/events ?) exists now? If so, I believe this task is do-able by changing wgEventLoggingServiceUri to /beacon/v2/events?hasty=true ?

@phuedx @dr0ptp4kt I assume the new beacon endpoint (/beacon/v2/events ?) exists now? If so, I believe this task is do-able by changing wgEventLoggingServiceUri to /beacon/v2/events?hasty=true ?

It exists, although is presently only intended for processing the "everyone" (i.e. edge cache-varied experiments, typically intended for readers) A/B test-related events.

Presently the same-domain path for the evt-103e beacons disallow event missing a proper edge unique with an associated everyone A/B tests in wikimedia-frontend.vcl.erb and vmod_wmfuniq.c.

Before modifying wgEventLoggingServiceUri I think we would need to make just one change to the wikimedia-frontend.vcl.erb.

This said, I'll loop you @Ottomata onto a call soon with @Vgutierrez, @BBlack, and @phuedx so we can discuss there how we should go about doing this, and doing it in a pretty controlled manner to hopefully sidestep event loss or excess errors.

I should have put the comment on this here ticket instead of the old closed ticket, so doing so now (thanks again @Ottomata for heads up) - question for you @Krinkle . Any guidance appreciated if you've happened to be in the browser codebases or happened to be viewing their real world behaviors around this lately.

@Krinkle nowadays do the items in T226986#6467370 still apply? We're considering implementing a new URL path on the same domain as the webpage. But if browser implementations have shifted since this writeup @phuedx shared it would be helpful to understand what we may gain, or at least measure to see what we gain.

One expected side effect of this might be that more events would start showing up as some things become exempt because they're no longer in blocklists (at least for a while); of course implementations of these blocklists vary from plugin/proxy-to-plugin/proxy.

You got me curious, so I spent a little time digging on this -- as far as I can tell not much has changed in the real world since Daniel Stenberg's blog post.

The only thing meaningful thing I could find that has changed is that, as of some time between 2016 and 2021, Safari implemented h2 connection coalescing similar to other browsers, requiring both IP address and TLS SAN match: https://mailarchive.ietf.org/arch/msg/quic/0nqKySdzmXK6CwZ9dKfD5GsQgqE/ although weirdly I couldn't find anything more authoritative or verbose than this post.

The immaterial changes:

  • The QUIC / http3 spec says to coalesce based solely on TLS SANs, and to not consider IP address, but in practice no browsers implement this in either their h2 or h3 protocol stacks.
  • In the internet standards space, there's a proposal for ORIGIN frames, but the real-world support for them is very very limited on both the server and client sides. Cloudflare has tried to get some attention around them.