Page MenuHomePhabricator

Could the registered module manifest be removed from the client?
Open, Needs TriagePublic

Description

Problem
The size of the startup module should be kept small as described in https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Page_load_performance#Size_of_scripts

The startup module includes a module manifest that is huge. You can see the manifest by accessing mw.loader.moduleRegistry in your browser's console or in the startup module directly: https://test.wikipedia.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector

Question
I'm wondering why the client needs to be aware of the entire module registry? It seems like it only needs to be aware of the modules it is loading or has loaded? Am I missing something?

I suppose a downside would be the client may attempt to load modules that do not exist, but that seems like a better trade off? What would break if we were to do this?

Event Timeline

The reason the client has to have the module registry is because of our caching strategy. It needs the version hashes of each module in order to be able to compute the correct URL when requesting modules.

The way RL's caching strategy works, is that the startup module uses a "cache for a short time, no invalidation" strategy: the cache TTL is 5 minutes, so the client has to send revalidation requests frequently, but delivering updates doesn't require invalidation because just waiting for the cache to expire is fast enough. Every other module uses a "cache for a long time, invalidate on change" strategy: the cache TTL is 30 days, so the client almost never has to send revalidation requests, but explicit invalidation is needed because waiting out the TTL is unacceptably slow. These two strategies work together: the version hashes in the manifest delivered in the startup module are used to invalidate the cache for the other modules.

In more detail: when we request a regular module, we send a request like /w/load.php?modules=foo&version=123abc (where 123abc is a hash of the contents of the module that we get from the manifest), and the response has long-TTL caching headers. If we request the module again later, but its contents haven't changed, we'll request it with the same URL (because the version hash is the same), and we'll hit the browser's HTTP cache. When the module's contents change, the version hash will change, which will be reflected in the manifest the next time we re-request the startup module (so we'll know about it at most 5 minutes later), and we'll request /w/load.php?modules=foo&version=456def instead. The browser's HTTP cache considers that a completely different URL which it hasn't seen before (the version= parameter functions as a cache busting parameter), so it will ignore what it has in cache and request it anew, and get the updated version.

This dual strategy is how we deliver module updates to all clients within 5 minutes(*) without having lots of revalidation requests caused by short TTLs.

(*) This analysis only considers the client-side HTTP cache and ignores the server-side HTTP cache in Varnish/ATS. Because it also takes up to 5 minutes for the server-side HTTP cache to see the new manifest, the worst-case time until a client gets a module update can be up to 10 minutes, depending on how the server-side HTTP cache is configured (and I don't remember offhand how it's configured currently).

@Catrope Thanks for that explanation, that is really helpful.

Would it be more efficient for the startup module to have a "global" version string? Effectively if any of the modules change a new version string string would be generated that all of them could use. That would invalidate the modules more often, but avoid having to have the entire manifest sent every 5 minutes.

Hmm, that's an interesting idea. I'll mull it over and discuss it with @Krinkle. It's probably worth measuring how often such a global version string would change, compared to any individual module's version (for example, some modules are backed by wiki pages that admins can edit, so those can change more often than just the weekly deployment). We may be able to tackle that by excluding wiki page-based modules from the global version (and version them individually instead), or by having a "default version" that applies to most modules except those changed after the initial deployment.

Another purpose we use the manifest for is dependency resolution. We could move that server-side, which could result in some modules being transmitted twice, but that's probably not too bad (because most page views only make one or two load.php requests). At minimum, if we got rid of versions, we could drop from the manifest all modules that don't have dependencies (and aren't depended on).

Another purpose we use the manifest for is dependency resolution. We could move that server-side, which could result in some modules being transmitted twice, but that's probably not too bad (because most page views only make one or two load.php requests). At minimum, if we got rid of versions, we could drop from the manifest all modules that don't have dependencies (and aren't depended on).

Another solution could be to deliver the module, but instruct the loader to load the dependencies (if they haven't already been loaded). It would increase the latency on the first load, but decrease the amount of data transferred and increase the cachability.

In a nut shell:

Why large

The registry is large because it grew into a tree of internal modules for code organization, which have little to do with logical entry points for an end-user contexts and/or an intentionally split bundles for code sharing or other optimisation purposes. As of last year, largely led by @Ladsgroup and myself, we've been paying off this debt and reduce down the registry as much as possible to just logical entry points, with some splits only where they clearly benefit performance.

Why large, really

Until last year, it had become the defacto standard (which I failed to notice in time) to register every tiny piece of code as its own module bundle. For most cases, I suspect this was the result of one of more of these factors:

  1. It is very easy to register a module bundle with RL. In fact, so easy, that we don't feel like there is any cost or reason not to. Compared to e.g. configuring a new rollup or webpack bundle, which is also easy, but those present themselves in a way that makes it more of a big deal, and you have to do a little bit of wiring to load each bundle onto the page.
    • As part of T193826, I've pushed for code to be organized in directories by module bundle on disk (with as many files and subdirectories as you see fit inside of that). This helps with a lot of things day-to-day in terms of discovery, file browsing, and understanding the structure of a repo, and is something we'd benefit from no matter what delivery system we use. But the act or creating that directory, I hope, adds just that tiny little extra mental check to think about it, or maybe that's just in my head :)
    • For large code bases like VisualEditor, the extension.json file is generally the only place where one enumerates external dependencies and interface messages. Having to manage these in a single list for the entire repo doesn't feel right to developers because it makes it harder to know what is used by what area of the code, e.g. when refactoring. This can ge addressed by creating the arrays separately and then merging them for registration., However, there isn't a strong incentive for that today. It's easier just to create more modules. T225842 will make a simple "array merge" supported in extension.json without needing to write any code.
  2. We use the term "module" for what the industry now refers to as a "bundle". This technically doesn't matter, but it can subconsciously contribute to treating modules as cheap. Our naming for this predates the rise of Node.js and bundlers. I've updated most of our documentation to reduce use of the word "module" in favour of "module bundle". With things like T225842 we can take it a step further and long-term also rename some of the internal mechanisms.
  3. I think we sometimes prematurely optimize by creating smaller modules, with expectation that it will only load what's needed and thus help with performance. In the abstract that may be true, but in practice often isn't. It applies to large external modules, but for most internal stuff the gains are too small. Cancelled out either by the very overhead they introduce, or cancelled out by the oppertunity cost that could've been used to improve the application itself. I'll also note here that amount of bytes of JS code loaded is often not the bottleneck in terms of download/compile/execution. Such fragmentation also means there's far fewer oppertunities for cache re-use or avoiding network roundtrips.

Why at all

Roan already covered this, but I'll briefly describe it as well in case that helps complement the picture. It is indeed not about module existence. I don't think we'd generally mind that a bad deploy or other broken code with an unknown module reference will be included in the network batch and transfer a few more bytes in response to also indicate that one module doesn't exist. Wouldn't even slow anything down generally given that we use batch requests. I see the reason for the manifest existing primarily for performance, and to avoid the CCI problem.

See also: mw.org

  • Dependencies: If dependencies are only known on the server, it would presumably send all dependencies to the client, even if it already had them from another request. This doesn't just apply to a single page view, it also applies across pages. Given that batches will vary between pages, we would not use the browser's http cache as modules would be stored under a different URL part of an unrelated batch. We currently unpack batch responses and store and retreive these on their own directly from the cache, before performing any fetch. Thus on later page views, you often end up fetching nothing at all, or only modules unique to that page.
  • Version: In order to know whether a cache entry is fit for use, we key on "name+version". This justifies having some concept of a version. How granular to make that is indeed then a trade-off between bytes we sent in the startup module (sent at most once per browsing session, 304'ed until the next deployment) and bytes we send in updates for module source code that turned out not to really have any update.

RE: Shared module version

I like the idea of versioning multiple module bundles together. There is some prior art on this in the form of "wildcard modules" (T160690). We also use "combined hashes" already for E-Tag validation, and for the version query parameter.

The question is: Can we reuse versions in a way that will notably reduce the size of the startup module?

Taking enwiki startup, it is currently 21K in transferred bytes (gzipped). Of which ~13K is the manifest and ~7K the RL client.

If I find ,"[0-9a-z]{5}" and replace with ,1, then the end result compresses down to 17K. This means a naive approach could cut at most 4KB. (In practice, we'd want to keep separate versions for some things, e.g. gadgets, site scripts, and a few major features).

Also note that a smaller delta between deployments would not help by itself as HTTP transfers are in full, and the cache-life/frequency of that transfer is whenever something is deployed and thus If-None-Modified returns false. There might be a cleverer way to organise the manifest to optimize for gzip better.

My current opinion is that the manifest isn't an issue. It's effectively our one and only compromise in exchange for one of the world's best performing sites at scale. With generally no serial roundtrips anywhere in the JS stack (after the site-wide cachable startup, we go straight to fetching all code in a single concurrent batch.) As such, I haven't focussed on improving it very much other than to reduce the number of entries.

Having said that, I have no reservations for keeping it beyond serving the above benefits. If we can cut it down or replace it, and improve performance in doing so, I'm all for that!

See also

daniel added a subscriber: daniel.

Nothing to do for TechCom here at this point, but thanks for the heads up.

I was curious and did more reading on Architecture. I see Krinkle dominates the edit history there, so thank you so much, that page is exemplary documentation.

After reading the above, I was still wondering why the whole manifest has to get shipped on that initial page load. The 5 minute/30 day caching strategy is clever and makes a lot of sense, but why not some customized much smaller manifest including just the hashes for modules relevant to the initial page load?

900+ wikis, 400+ languages, 5 skins, and ~ 1100 modules. Generating all these variants during deployment could take hours. In addition, the wiki's are always in flux with many modules having the capacity to vary their response based on the content of a publicly editable wiki page.

Is that right? Basically if we could somehow say "within the first 10 seconds of any page lifecycle, these 100 modules are all that's ever needed", then we could make two manifests: X and Y. Then X could include the hashes of those 100 modules and get invalidated every 5 minutes. Y would be loaded when a dependency is not found in X.

If not, I'd love to understand what I missed so maybe I can update the docs. If so, then I know how to figure out what X would need to be, by analyzing load.php requests in real-time, but I'm not sure it would be worth the few KB saved. So I'll file that away as a fun thought.