Page MenuHomePhabricator

Versioned load.php url for long caching gets short-cached instead
Closed, ResolvedPublic

Description

Identified via https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMain_Page&tab=mobile, I was surprised to find a versioned load.php url among the list of urls with no expiry or a short expiry.

Given our current platform, I expect to only find thumbnails (no expiry), skin stylesheet (5 minutes) and the startup module (5 minutes) in that list, but instead also found:

  • https://en.m.wikipedia.org/…er.defaults&skin=minerva&version=15glgu9 (5 minutes)

For that test, the full url was:

https://en.m.wikipedia.org/w/load.php?debug=false&lang=en&modules=...&skin=minerva&version=15glgu9

Which currently responds with:

cache-control: public, max-age=300, s-maxage=300
etag: W/"1e9lqn3"

age: 292
backend-timing: D=95385 t=1519316833509087
content-encoding: gzip
content-length: 99653
content-type: text/javascript; charset=utf-8
date: Fri, 23 Feb 2018 08:27:10 GMT
expires: Fri, 23 Feb 2018 08:27:18 GMT
server: mw1242.eqiad.wmnet

This is expected behaviour when a url is constructed with a version parameter that doesn't match the current version by the server. The server only has the current versions of static assets and will respond with those as such, but we don't want to cache them under the "wrong" url for too long.

This was implemented to address T117587 (6fa1e56), which is the problem where during a deployment, a user can get module meta data from server A (with new code) and then construct its versioned url client-side and fetch (this time to server B; with old code still), at which point we don't want to popular the Varnish edge cache with the "old" content under the "new" url version.

However, I don't understand why the client is requesting the wrong version in this case. It doesn't appear to have changed recently.

Additionally, I did verify to see if this may be a bug with any of the individual modules (e.g. oscillating version, or unmatchable/non-deterministic version hash), but that wasn't the case. If I request the same url with the etag version in the query parameter, then I get the correct response headers.

https://en.m.wikipedia.org/w/load.php?debug=false&lang=en&modules=...&skin=minerva&version=1e9lqn3

cache-control: public, max-age=2592000, s-maxage=2592000

Event Timeline

Krinkle created this task.Feb 23 2018, 8:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2018, 8:37 AM
Krinkle triaged this task as High priority.Feb 23 2018, 8:37 AM
Imarlier assigned this task to Krinkle.Feb 26 2018, 9:15 PM
Krinkle added a comment.EditedFeb 28 2018, 12:01 AM

Whatever it is, it not is a generic issue with all mobile load.php requests. It only happens for 1 module load.php request with lots of modules in the batch.

For example:

  • The request for jquery + mediawiki (url), has a matching E-Tag and correct long-term expiry.
  • request for mobile.toc.images, randomly saw it when logged-in (url), has matching E-Tag and long expiry.
  • request for ext.gadget.switcher (url) is also fine.
  • request for modules=user (with user query), is also fine.
  • request for mobile.toc%7Cmobile.toc.images%7Cskins.minerva.tablet.scripts is also fine.
  • request for mobile.loggingSchemas.mobileWebMainMenuClickTracking%7Cschema.MobileWebMainMenuClickTracking is also fine.

The only one failing for my random page view is for: https://en.m.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.centralNotice...ext.eventLogging..skins.minerva...&skin=minerva&version=0tpfjz9.

Full module lines:

ext.centralNotice.bannerHistoryLogger
ext.centralNotice.choiceData
ext.centralNotice.display
ext.centralNotice.geoIP
ext.centralNotice.impressionDiet
ext.centralNotice.kvStore
ext.centralNotice.kvStoreMaintenance
ext.centralNotice.largeBannerLimit
ext.centralNotice.legacySupport
ext.centralNotice.startUp
ext.centralauth.centralautologin.clearcookie
ext.eventLogging
ext.navigationTiming
ext.wikimediaEvents
ext.eventLogging.Schema
ext.eventLogging.subscriber
ext.relatedArticles.readMore.bootstrap
ext.relatedArticles.readMore.eventLogging
ext.relatedArticles.readMore.gateway
ext.visualEditor.targetLoader
ext.visualEditor.track
ext.visualEditor.ve
ext.wikimediaEvents.loggedin
jquery.accessKeyLabel
jquery.byteLength
jquery.client
jquery.cookie
jquery.textSelection
jquery.throttle-debounce
mediawiki.RegExp
mediawiki.Title
mediawiki.Uri
mediawiki.api
mediawiki.cldr
mediawiki.cookie
mediawiki.experiments
mediawiki.jqueryMsg
mediawiki.language
mediawiki.notify
mediawiki.router
mediawiki.storage
mediawiki.template
mediawiki.user
mediawiki.util
mediawiki.viewport
mediawiki.api.user
mediawiki.language.data
mediawiki.language.init
mediawiki.libs.pluralruleparser
mediawiki.page.startup
mediawiki.template.hogan
mediawiki.template.regexp
mediawiki.ui.anchor
mediawiki.ui.input
mobile.ajax.styles
mobile.editor.api
mobile.init
mobile.issues
mobile.references
mobile.search
mobile.site
mobile.startup
mobile.toggle
mobile.watchstar
mobile.issues.images
mobile.loggingSchemas.mobileWebSearch
mobile.pagelist.scripts
mobile.pagelist.styles
mobile.pagesummary.styles
mobile.references.gateway
mobile.references.images
mobile.search.api
mobile.search.images
mobile.search.util
mobile.startup.images
mobile.startup.images.variants
oojs
oojs-router
schema.MobileWebSearch
skins.minerva.editor
skins.minerva.mainMenu
skins.minerva.notifications
skins.minerva.scripts
skins.minerva.talk
skins.minerva.toggling
skins.minerva.watchstar
skins.minerva.icons.images.scripts
skins.minerva.icons.images.variants
skins.minerva.mainMenu.icons
skins.minerva.mainMenu.styles
skins.minerva.notifications.badge
skins.minerva.scripts.top
user.defaults
Krinkle added a comment.EditedFeb 28 2018, 7:46 PM

When I hacked together some inline scripts to make the client-side JavaScript on en.m.wikipedia.org call mw.loader#getCombinedVersion() with the above array of module names, it actually ends up producing the correct version string. The one that the server expects, instead of the one we see the client sending.

This means somehow the array of module names, or the order of module names is different between the point where we normally call mw.loader#getCombinedVersion() on the client, and the point where we call doRequest() on the client with the compressed module query parameter.

I can actually reproduce this bug locally in my mediawiki-docker-dev environment with regular desktop requests, so it's probably just bad luck that whatever bug we're facing happens to only happen on mobile. In my local environment, just like on mobile in prod, most versioned requests are fine, but the "main" one with lots of modules has a shortened expiry and a version query that doesn't match the E-Tag.

Here's a comparison between what the client-side has as module list, while executing mw.loader#getCombinedVersion, and what the server-side has as module list while executing ResourceLoader::getCombinedVersion.

  [console.log][mw.loader.work] module list for request:  (24)
  > ["ext.eventLogging.subscriber", "ext.navigationTiming", "jquery.accessKeyLabel", "jquery.checkboxShiftClick", "jquery.client", "jquery.cookie", "jquery.getAttrs", "jquery.highlightText", "jquery.mw-jump", "jquery.suggestions", "jquery.tabIndex", "jquery.throttle-debounce", "mediawiki.RegExp", "mediawiki.api",
- > "mediawiki.api.user", "mediawiki.notify", "mediawiki.page.ready", "mediawiki.page.startup", "mediawiki.searchSuggest", "mediawiki.storage", "mediawiki.user",
  > "mediawiki.util", "skins.vector.js", "user.defaults"]


  [PHP] ResourceLoader::getCombinedVersion: (24)
  > ["ext.eventLogging.subscriber", "ext.navigationTiming", "jquery.accessKeyLabel", "jquery.checkboxShiftClick", "jquery.client", "jquery.cookie", "jquery.getAttrs", "jquery.highlightText", "jquery.mw-jump", "jquery.suggestions", "jquery.tabIndex", "jquery.throttle-debounce", "mediawiki.RegExp", "mediawiki.api",
-  > "mediawiki.notify", "mediawiki.searchSuggest", "mediawiki.storage", "mediawiki.user", "mediawiki.util", "mediawiki.api.user", "mediawiki.page.ready", "mediawiki.page.startup",
  > "skins.vector.js", "user.defaults"]

The lines marked with a red minus sign, are where the difference is. I suspect this might be the result of either sort() not behaving the same way between JS and PHP (quite probably in general, but unlikely to be relevant here given our simple ASCII strings), or somehow our expandModuleNames()/makePackedModulesString() implementations change the order of modules, which they're not supposed to.

I did find out that ResourceLoader::makePackedModulesString() in PHP does not preserve the original ordering. When it iterates through, it will append to any group, not just the current group. However, while that might be unexpected, it does lead to better compression and (I think) it shouldn't be a problem because:

  • We only sort before compression on the client, to improve consistency of urls and likelihood of Varnish caching. There is no expectation that sorts between browsers or between a browser and PHP is the same. We don't re-sort on the server (I think?), we only expand the input string and use it as-is. So difference in sorting shouldn't matter.
  • Assuming the not-in-order behaviour is only happening in compression, and not in decompression, why would we end up with a different array on the server after decompression?

The array marked [PHP] in the comment above is out of order. mediawiki.api.* and mediawiki.page.* appear too late in the array (they should appear right after their parents, as they do in the [console.log] one, but they instead appear after mediawiki.util). Either this is using a very strange sort function, or maybe the PHP code doesn't re-sort after dependency resolution is done, or maybe the decompression function on the server doesn't preserve order? I'm leaning towards the latter because the only module names affected are the ones that would use comma syntax in the compressed format.

Krinkle added a comment.EditedMar 1 2018, 12:14 AM

The array marked [PHP] [..] or maybe the decompression function on the server doesn't preserve order? I'm leaning towards the latter because the only module names affected are the ones that would use comma syntax in the compressed format.

(Roan and I figured it out in-person, documenting here.)

The decompression function does preserve order and is fine.

We do sort the array client-side before compression, and (for what its worth) I did verify the sorting for these strings in particular would happen the same way in PHP as in JS. Although that doesn't matter in this case because there is no sorting involved when the server-side decompresses the module list and verifies the version hash. The server does not, and should not, apply any sorting here.

The contract between the client and server is as follows:

  • The client can request one or more modules in the same request. It ends this request with a version query parameter, which is a hash of all the requested modules' hashes, as combined in the same order as the list of modules in the request.
  • The server optionally decompresses the list, then combines its locally known hashes in the same order, hashes the result and verifies whether they match.

In order to increase the likelihood of a cache-hit in the browser and in Varnish, we do sort the list of modules client-side, but otherwise this is insignificant. None of the relevant methods do any sorting themselves, and are supposed to work with any order given.

In my previous comment, I pointed out that the only method involved here that effectively changes order is makePackedModulesString (and its JavaScript equivalent, inlined in mw.loader#batchRequest). The methods basically work as follows (simplified for previty, not complete):

pseudo code (like php/js)
function makePackedModulesString( string[] modules ) {
  groups = [];
  foreach ( modules as module ) {
    prefix = modules[ i ].substr( 0, lastDotIndex );
    suffix = modules[ i ].slice( lastDotIndex + 1 );
    groups[ prefix ][] =  suffix;
  }
  pieces = [];
  foreach ( groups as prefix => suffixes ) {
    prieces[] = prefix . '.' . suffixes.join( ',' );
  }
  return pieces.join( '|' );
}

For the most part, this preserved order. Except that when it uses the current prefix to consider a group to extend, it considers all groups, not just the group seen from the last iteration. Which means a list like a.a, a.b, z.a, a.c will end up packed as a.a,b,c|z.a. Now, granted, that edge case should be eliminated by the coincidence that we sorted the list ahead of time.

However even when sorted, this can still go wrong because sorting applies to the complete module name. So when the (sorted) list is:

mw.a
mw.a.a
mw.a.b
mw.b

This will be compressed as mw.a,b | mw.a.a,b, which expands to:

mw.a
mw.b
mw.a.a
mw.a.b

So the client is violating the earlier specified contract because its version hash is based on modules in an order sorted by full name, whereas the actual module list effectively ends up in an order where the groups are sorted, with their "children" sorted within. Thus leaving the overall list not in the original order.

As mentioned before, it doesn't matter whether either the modules or version parameter uses a sorted list. The only thing that matters is that they use the same order. Which right now they don't because the order of operations is:

  1. get module list, and sort it (as cache optimisation).
  2. build combined version has by combining hashes in order of the module list.
  3. build module string in compressed form, which ends up puts later modules like mw.b ahead of mw.a.a.

Unfortunately, I have to conclude this can't be a recent regression. The order discrepancy has been in ResourceLoader since at least the initial release in 2011. In terms of the order being used for anything, this was introduced about 2-3 years (2015) when we migrated ResourceLoader from using version timestamps to version hashing (f37cee996ee9). But even then it didn't matter as the contract with the server at that point was still only that the version query should differ from any other url with the same module query parameter when it needs to bypass the cache. Its value wasn't used server-side until about a 1-2 years ago (2016) when we added validation of the version parameter in order to fix T117587.

We just haven't noticed before as it just so happens to as often before now, I guess?

I've actually been able to reproduce this consistently on desktop as well. So the impact is potentially quite big.

Fortunately, once we fix this, we should see a significant improvement at Grafana dashboard: ResourceLoader, specifically I'd expect a big shift in "200 OK" responses from short expiry to long expiry, which would be nice :)

Change 415494 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Clean up and better document module list (un)packing

https://gerrit.wikimedia.org/r/415494

Change 415497 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Make load.mock.php standalone (no includes)

https://gerrit.wikimedia.org/r/415497

Change 415513 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] resourceloader: Failing test for T188076

https://gerrit.wikimedia.org/r/415513

Change 415516 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] resourceloader: Fix T188076

https://gerrit.wikimedia.org/r/415516

Change 415650 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Fix mw.loader to compute version for current request only

https://gerrit.wikimedia.org/r/415650

Change 415516 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] resourceloader: Fix client to combine versions in packed order

https://gerrit.wikimedia.org/r/415516

Agabi10 added a subscriber: Agabi10.Mar 1 2018, 9:50 PM

Change 415513 abandoned by Krinkle:
resourceloader: Add failing tests for T188076

Reason:
Split and squashed into I00d6985c and I211523d47816.

https://gerrit.wikimedia.org/r/415513

ori awarded a token.Mar 8 2018, 12:10 AM
ori added a subscriber: ori.

This is really impressive, Krinkle. Kudos.

Change 415494 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Clean up and better document module list (un)packing

https://gerrit.wikimedia.org/r/415494

Change 415497 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Simplify load.mock.php by removing includes dependency

https://gerrit.wikimedia.org/r/415497

Change 415650 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Fix mw.loader to compute version for current request only

https://gerrit.wikimedia.org/r/415650

Change 415516 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Fix mw.loader to compute combined version in packed order

https://gerrit.wikimedia.org/r/415516