Page MenuHomePhabricator

Should we host Wikipedia Preview?
Closed, DeclinedPublic

Description

Current situation

Wikipedia Preview is currently distributed as an npm package (and consequently hosted on unpkg.com). We encourage partners to download it using npm install to self host with their own website.

Issues

While downloading and serving the script is a relatively simple task, it adds friction for the partners wanting to try it out without being fully committed yet. It also introduces the concept of upgrade as a regular activity for partners and since they will do it at their own discretion, we will likely have several different versions in use at any given time.

Solution

Host Wikipedia Preview (script, images) from a central location and encourage all partners to load it from there. This would allow us to keep it up to date and control the versions being used.

Option 1: Host on WMF infrastructure

Example URL: https://www.wikipedia.org/static/wikipedia-preview/wikipedia-preview.js

Pros:

  • Fast & reliable infrastructure
  • High level of privacy for readers of partners' websites

Cons:

  • Aggressive caching in the static directory may not work well for a script we update frequently
  • Deployment through a config patch in a swat window is a bit more work than we would like

Notes:

  • There's probably other places better suited for this than the static directory so the cons above may not hold.
Option 2: Host on Github Pages

Example URL: https://wikimedia.github.io/wikipedia-preview/wikipedia-preview.js

Pros:

  • Fast and easy automated deployment from CI. Can be for every changes on main, based on version changes, based on tags being added, etc
  • Self-contained in the repo. Keeping things closer together means easier onboarding and maintenance.
  • Probably fast & reliable but outside of our control

Cons:

  • Probably fast & reliable but outside of our control
  • May not be up to our privacy standards for readers

Event Timeline

May not be up to our privacy standards for readers

Wondering about this line, is that because of getting script from mediawiki domain?

Wondering about this line, is that because of getting script from mediawiki domain?

We do not want to recommend serving content from someone else where the user is/can be be tracked. So either they need to serve it themselves from their own servers or get it from our domain.

@Krinkle are you aware of an appropriate place in our infrastructure to host this script? Somewhere similar to the static directory but with less aggressive, or more configurable, caching?

@SBisson I think if we are going to host it, then the static directory would be an appropiate place for it. The caching should not be an issue so long as file names (or their URLs) are properly versioned, e.g. as foo_hash.png or foo.png?hash, which is something one should do regardless for numerous reasons. I've not encountered cases where this is not easy or feasible, but I'm happy to help if it seems like it wouldn't work well here.

Note that if committing to wmf-config is a barrier in terms of risk or dev ux, then one could also set this up as a microsite, or as something we semi-automatically publish to https://releases.wikimedia.org/. Both of these are similarly well-cached and served through our CDN edges.

As for whether to host at all, I would lean toward not doing it. I can't think of any significant benefits it would bring, and there are tons of negatives to consider mid and long term, such as: performance (overhead), security risk (all or nothing), stability (our script could make the vendor site slower, multiplied failure modes), and availability (will we ever remove it? how many broken sites is tolerable?).

I also think it is no longer the industry practice that it once was. You rarely see JavaScript libraries anymore that companies adopt by direct inclusion of a script from a separate URL hosted elsewhere. Of course there's millions of sites out there that do this, and demos and examples sometimes still do this so that their doesn't look "wrong" to a developer as they might dislike the particular framework or bundler being used in the demo otherwise. By linking it as a script tag, its more neutral. But in terms of "real" usage in production, even for simple sites and startups driven by junior engineers, self-hosting is the standard and current best practice in our industry as I see it.

Performance?

Thinking back about why the old trend emerged in the first place, I believe this was largely driven by web performance folks. The idea being that using a CDN had two expected benefits:

  1. In the world of HTTP 1.1, using a CDN meant a separate TCP connection, which could actually help speed up the download of the script. This due to browsers limiting the number of connections they would open for a single hostname (e.g. 6 at most). This incentive also led to "domain sharding" such as img(\d+).example.org where you'd spread out requests to the same underlying asset group over different URL domains to bypass this limit, so you can reqest more than 6 assets at a time. This wasn't without drawbacks, because extra connections means more delays from TLS/HTTPS handsakes being set up for each one, and slower downloads due to each connection having to ramp up its bandwidth congestion window from scratch ("tcp slow start"). But, this was deemed worth it because the alternative was to wait until other requests are finished and one of the limited 6 connections being available for your asset.
  2. In the world before Specte/Meltdown and before anti-privacy fingerprinting tactics were better understood, it was possible for different sites to request the same third-party URL and thus share their browser's HTTP cache for that URL. The idea being that if two unrelated websites both download jquery-3.4.1.js from https://code.jquery.com, the user will benefit as it is likely to be in their cache already.

Both of these benefits no longer exist. Those same web performance folks now, pretty much universally, recommend strongly against that same use of CDNs.

  1. In HTTP2 and HTTP3 the tables have turned. We now have multiplexing on a single connection, which means we no longer need separate connections to request multiple assets concurrently. And we also don't want separate connections even if we could, because those would be slower (separate TCP/HTTPS first, and "tcp slow start"). For the same reason, self-hosting is always faster than waiting in the critical path for another DNS look up, TLS/HTTPS, and then the TCP slow start. There is also significant gains to be had of course by bundling, minifying and compressing unrelated scripts together rather than as separate files. Hence it is also increasingly less common to serve JS libs as merely a self-hosted file, but actually to include it e.g. in your "common_vendor.js" build artefact or in some fragment of that.
  2. The use of HTTP cache poisoning tricks to fingerprint people, has led browsers to partition their HTTP cache by caller domain. (countermeasures popularised by Apple in WebKit, blog 1, blog 2). This means even if two unrelated sites do both use Wikipedia Preview, and the same version of it, they would not share client caches. This, together with Spectre/Meltdown has driven Google and Mozilla to have done the same since then.

Anyway, so all that to say: It's not an industry best practice anymore, and it actually degrades performance and site stabity by doing so. (every extra domain to fetch scripts from is another connection that can fail or timeout for random reasons unrelated to us, and thus make the vendor's own site slow or unavailable for brief periods of time; involving more domains exponentially multiplies those odds).

Simplicity?

I do recognise that being able to "just" include a script can make it easier to build a quick demo, but we should probably provide our own documentation and demo pages if we aren't already. Plus, this "ease of getting started" assumes that the vendor in question has no bundler, and no web application in place that has opinions or requirements to use them, and no technical or social security policy in place that would prevent the injection of third-party executable scripts on their site. Remember that there is a big diference between using a third-party API to display information on your site, and trusting a third-party domain to send you - live at runtime - directly an executable script to run within your own trust domain. The latter means their cookies, credentials, and everything the user does and has done will leak. As well as referer activity possibly being sent to us about their user's reading activity. Plenty of companies don't care about that and even pay extra to leak more (ad networks are funny like that), but there's also plenty of orgs and tranditional media that wouldn't allow that. And even those that do, would most likely have a technical restriction in place that prevented loading code like that without approval somewhere first.

Anecdotally, in my experience as lead of QUnit, I've actually found that for some younger developers the use of "simple" script tags in documentation actually led to confusion and became a barrier to entry! Their expectation was to add it to a bundler. Their framework didn't even allow for the easy adding of HTML tags to pages – it's all local JavaScript imports these days. It's certainly a tragedy that our industry has made things so complicated, but that's a story for another time.

Anyway, that all to say: I would guess that the downloading of JS is a relatively small part of what it would take for a vendor to get started. If we have experiences that say otherwise, we should look at that, but my guess would be that it wouldn't help us much if we did this.

Thanks @Krinkle for the write up. We've found that 90%+ of the adoption comes from the WordPress plugin so we'll let the component be using as a local file for those doing their own technical integration of the component.