Page MenuHomePhabricator

Investigate possible sticky module cache poisoning bug
Closed, ResolvedPublic

Description

Try to reproduce the circumstances of T308798 or T317429 and test on different browsers to see where in the stack the bug exists. E.g. two temporary modules in WikimediaEvents, deploy, make a change, observe the contents of the other module in a non-fatal way and log result via statsd.

Event Timeline

Krinkle triaged this task as High priority.

Will do an initial investigation soon to reproduce the issue and try to pinpoint where in the stack the cause is (i.e. something we can fix, another team can fix, or upstream-vendor issue we may need to report upstream and/or work around).

In case it's helpful, using the filterable error_context.version field it seems we've got errors in last 7 days for extremely old deployment branches e.g. 1.40.0-wmf.10
https://logstash.wikimedia.org/goto/62d8724bf628210ffb0f741d80e7feed

In case it's helpful, using the filterable error_context.version field it seems we've got errors in last 7 days for extremely old deployment branches e.g. 1.40.0-wmf.10
https://logstash.wikimedia.org/goto/62d8724bf628210ffb0f741d80e7feed

As I understand it, this task is about mixed versioning, where a particular module is served from a (to be identified) cache layer where old code got permanently stuck or renewed despite other changes coming through. The assumption being that during an update, the old code somehow got stored under a new cache key (aka "cache poisoning").

I expect that in the examples Bartosz described, the errors do not have an old mwversion value in the client-error beacons. The mwversion value comes from mw.config which is set by the startup and mediawiki.base modules. However, the bug we see isn't about old startup data or old mw.config data. It is about (one) regular core or extenson module containing old code while other modules on the page are newer.

The mw-client-error beacons linked above, I assume, are from e.g. browser tabs that have been open since then, or from mirror sites, or from the resumed page view after restoring a browser session. And specifically, those errors would have been happening back then for other people as well, and may still be happening today on newer versions. Probably unrelated.

@Jdlrobson Do you have a specific example of an entry in mw-client-error that carries an old mwversion label that did happened neither when that version was current nor today? I.e. where it appears to be happening due to mixed version or other cache poisioning?

Change 915977 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Log when MAXAGE_RECOVER is detected

https://gerrit.wikimedia.org/r/915977

I did some emperical testing on mediawiki.org (MediaWiki:Gadget-T321394-main.js), and at least anecdotally at the scale of mediawiki.org, changes consistently proparate through our caches within 5min.

https://grafana-rw.wikimedia.org/explore?left=%7B%22datasource%22:%22000000001%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22graphite%22,%22uid%22:%22000000001%22%7D,%22target%22:%22T321394_testing.%2A.%2A.sum%22%7D%5D,%22range%22:%7B%22from%22:%221683247858163%22,%22to%22:%221683253506630%22%7D%7D&orgId=1

(In the hour before this query's date range is a handful of mismatches, caused by me previewing an edit. We preview edits to JS by executing them.)

Next step, I've added some instrumentation to the ResourceLoader server to check whether (our own) caching may be playing a rule.

Change 915977 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Log when MAXAGE_RECOVER is detected

https://gerrit.wikimedia.org/r/915977

Change 915977 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Log when MAXAGE_RECOVER is detected

https://gerrit.wikimedia.org/r/915977

Change 915720 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@wmf/1.41.0-wmf.7] ResourceLoader: Log when MAXAGE_RECOVER is detected

https://gerrit.wikimedia.org/r/915720

Change 919208 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Fix "out of sync" message to ignore $errors case

https://gerrit.wikimedia.org/r/919208

Change 919208 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Fix "out of sync" message to ignore $errors case

https://gerrit.wikimedia.org/r/919208

There's a good amount of data coming in. About 500K entries per day (after Varnish cache).

The majority reflect the existence of a minor bug in how legacy PNG image rasterizations are served. It seems the logic we use to produce the version= URL for those images differs from the validation logic, which means we are serving them with a shortened cache TTL all the time, even after the deployment window is completed. This doesn't cause any visible problems beyond inefficient caching, but should be fixed of course.

Filtering those out for the moment, leaves about 100K per day under:

type:mediawiki AND channel:resourceloader AND url:version AND NOT (url:image AND url:format)

I'll investigate those further this week to see whether I can reproduce them and what they might have in common.

There's a good amount of data coming in. About 500K entries per day (after Varnish cache).

The majority reflect the existence of a minor bug in how legacy PNG image rasterizations are served. […]

Filtering those out for the moment, leaves about 100K per day under:

type:mediawiki AND channel:resourceloader AND url:version AND NOT (url:image AND url:format)

The majority of those have third-party referrers set, i.e. mirrors like Wikiwand.com which seems to be serving cached content longer than our caching headers promise freshness for. Thus it falls back to getting current version, which is good thing, but also outside our control either way.

Filtering out that as well, we're left with ~0K per day (46 entries to be exact, from test2wiki).

type:mediawiki AND channel:resourceloader AND url:version AND NOT (url:image AND url:format) AND referrer:wikipedia.org

These entries seem genuine in that they show requests for user scripts around the time an edit was made, where the cross-wiki import of that is briefly out of sync while cache propagation takes place. I don't see any prolonged peaks where things don't self-correct as they should.

I'll leave this open for a few more days to see how it behaves around train deploys. If nothing comes up I'll close it after that, and create a new task for the PNG image version bug.

The majority reflect the existence of a minor bug in how legacy PNG image rasterizations are served. It seems the logic we use to produce the version= URL for those images differs from the validation logic, […]

I can't reproduce this locally. I used http://mw.localhost:8080/w/load.php?modules=oojs-ui.styles.icons-moderation&skin=vector&lang=en as example, and patched ImageModule.php as follows to generate the rasterized URLs in an easily accessible way:

diff --git a/includes/ResourceLoader/ImageModule.php b/includes/ResourceLoader/ImageModule.php
index c90b9757ce..f51a890f7d 100644
--- a/includes/ResourceLoader/ImageModule.php
+++ b/includes/ResourceLoader/ImageModule.php
@@ -378,7 +378,8 @@ class ImageModule extends Module {
 		$imageDataUri = $this->useDataURI ? $image->getDataUri( $context, $variant, 'original' ) : false;
 		$primaryUrl = $imageDataUri ?: $image->getUrl( $context, $script, $variant, 'original' );
 		$declarations = $this->getCssDeclarations(
-			$primaryUrl
+			$primaryUrl,
+			$image->getUrl( $context, $script, $variant, 'rasterized' )
 		);
 		return implode( "\n\t", $declarations );
 	}
@@ -389,9 +390,10 @@ class ImageModule extends Module {
 	 * @param string $primary Primary URI
 	 * @return string[] CSS declarations
 	 */
-	protected function getCssDeclarations( $primary ): array {
+	protected function getCssDeclarations( $primary, $fallback ): array {
 		$primaryUrl = CSSMin::buildUrlValue( $primary );
 		return [
+			"krinkle: $fallback;",
 			"background-image: $primaryUrl;",
 		];
 	}

Example output:

.mw-ui-icon-bookmarkOutline:before{krinkle:/w/load.php?modules=oojs-ui.styles.icons-moderation&image=bookmarkOutline&format=rasterized&skin=vector&version=1vew8

Where http://mw.localhost:8080/w/load.php?modules=oojs-ui.styles.icons-moderation&image=bookmarkOutline&format=rasterized&skin=vector&version=1vew8 responds with:

Cache-Control: public, max-age=2592000, s-maxage=2592000, stale-while-revalidate=60
Content-Type: image/png
ETag: W/"1vew8"

Where version= and ETag match exactly.

Change 937561 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Remove unused rasterized URL in ImageModule::getCssDeclarations

https://gerrit.wikimedia.org/r/937561

Change 937561 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Remove unused rasterized URL in ImageModule::getCssDeclarations

https://gerrit.wikimedia.org/r/937561

The majority reflect the existence of a minor bug in how legacy PNG image rasterizations are served. It seems the logic we use to produce the version= URL for those images differs from the validation logic, […]

I can't reproduce this locally.
[…]
version= and ETag match exactly.

Okay... found one now:

https://fa.m.wikipedia.org/w/load.php?lang=fa&modules=ext.cite.styles%7Cext.relatedArticles.styles%7Cext.wikimediaBadges%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cmobile.init.styles%7Cskins.minerva.base.styles%7Cskins.minerva.content.styles.images%7Cskins.minerva.icons.wikimedia%7Cskins.minerva.mainMenu.icons%2Cstyles%7Cwikibase.client.init&only=styles&skin=minerva

As of writing, this contains:

.mw-ui-icon-minerva-settings:before{
 background-image:url(/w/load.php?modules=skins.minerva.mainMenu.icons&image=settings&format=original&skin=minerva&version=p3sbw)
}
.mw-ui-icon-minerva-mapPin:before{
 background-image:url(/w/load.php?modules=skins.minerva.mainMenu.icons&image=mapPin&format=original&skin=minerva&version=p3sbw)
}

Whereas https://fa.m.wikipedia.org/w/load.php?modules=skins.minerva.mainMenu.icons&image=mapPin&format=original&skin=minerva&version=p3sbw responds with a mismatching version:

cache-control: public, max-age=60, s-maxage=60
etag: W/"ji8ek"

The majority reflect the existence of a minor bug in how legacy PNG image rasterizations are served. It seems the logic we use to produce the version= URL for those images differs from the validation logic, […]

Okay... found one now:

https://fa.m.wikipedia.org/w/load.php?lang=fa&modules=ext.cite.styles%7Cext.relatedArticles.styles%7Cext.wikimediaBadges%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cmobile.init.styles%7Cskins.minerva.base.styles%7Cskins.minerva.content.styles.images%7Cskins.minerva.icons.wikimedia%7Cskins.minerva.mainMenu.icons%2Cstyles%7Cwikibase.client.init&only=styles&skin=minerva

As of writing, this contains:

.mw-ui-icon-minerva-settings:before{
 background-image:url(/w/load.php?modules=skins.minerva.mainMenu.icons&image=settings&format=original&skin=minerva&version=p3sbw)
}
.mw-ui-icon-minerva-mapPin:before{
 background-image:url(/w/load.php?modules=skins.minerva.mainMenu.icons&image=mapPin&format=original&skin=minerva&version=p3sbw)
}

Whereas https://fa.m.wikipedia.org/w/load.php?modules=skins.minerva.mainMenu.icons&image=mapPin&format=original&skin=minerva&version=p3sbw responds with a mismatching version:

cache-control: public, max-age=60, s-maxage=60
etag: W/"ji8ek"

Narrowing it down a bit to a single module:

https://fa.m.wikipedia.org/w/load.php?lang=fa&modules=skins.minerva.mainMenu.icons&only=styles&skin=minerva
https://fa.wikipedia.org/w/load.php?lang=fa&modules=skins.minerva.mainMenu.icons&only=styles&skin=minerva
https://en.wikipedia.org/w/load.php?lang=fa&modules=skins.minerva.mainMenu.icons&only=styles&skin=minerva

These all respond the same, predicting the version as p3sbw (version query), and with the actual image responding with ji8ek (etag).

.mw-ui-icon-minerva-mapPin:before{
 background-image:url(/w/load.php?modules=skins.minerva.mainMenu.icons&image=mapPin&format=original&skin=minerva&version=p3sbw)
}

For localisation, ResourceLoader supports icons varying by direction or lang code (RL\Image::varyOnLanguage). However, the above icon lacks a lang= parameter, indicating that it was defined without such variance.

Interestingly, though, if we request the lang=en variant of the Minerva stylesheet, we get a different version query:

https://en.wikipedia.org/w/load.php?lang=en&modules=skins.minerva.mainMenu.icons&only=styles&skin=minerva

mw-ui-icon-minerva-mapPin:before{
 background-image:url(/w/load.php?modules=skins.minerva.mainMenu.icons&image=mapPin&format=original&skin=minerva&version=ji8ek)
}

This different, and has the benefit of being "correct" (ji8ek, as per the ETag). Meaning, this variant permit CDN caching for the full 30 days, rather than only a few minutes on the expectation that error correction will happen.

So to summarize, the issue is:

  1. The module/image version hash is computed based on all of the current URL's parameters
  2. The load.php URLs for modules always contain the 'lang' parameter
  3. The load.php URLs for images usually don't contain it (if the image doesn't vary by language)
  4. Therefore, when viewing a module and computing the hash for an image, we get a different value than when viewing the image itself

The relevant code is here: https://gerrit.wikimedia.org/g/mediawiki/core/+/8dbe4c04ac19728bd728347098fe319f5bd846e3/includes/ResourceLoader/Image.php#243

Which gives us two ways to fix this:

  1. Rework the logic for computing versions of images, to ignore the 'lang' parameter if it has no effect on the image, even when included in the URL
  2. Just always include 'lang' in the URL, even if it has no effect on the image

The second seems like the simpler way. It would basically mean reverting https://gerrit.wikimedia.org/r/c/mediawiki/core/+/519384. I suggest that we do that. (cc @Fomafix as the author of that change, in case you have a better idea)


Also, just to clarify: @Krinkle This issue that you've found is not actually related to the issues in T308798 or T317429 that prompted this task, right?

Change 951122 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] resourceloader: Always include 'lang' in ImageModule image URLs

https://gerrit.wikimedia.org/r/951122

Change 951122 merged by jenkins-bot:

[mediawiki/core@master] resourceloader: Always include 'lang' in ImageModule image URLs

https://gerrit.wikimedia.org/r/951122

This had a significant effect, but there are still some items being logged. I didn't investigate them.

https://logstash.wikimedia.org/goto/0b1e84c8cf5967c555b3221f5669f30d

image.png (535×2 px, 128 KB)

Change 949611 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Reduce MAXAGE_RECOVER log severity

https://gerrit.wikimedia.org/r/949611

Quoting the above commit:

We found no evidence of any cache poisoning. The bulk of the warnings from this line of instrumentation ended up due to logic bugs in RL\ImageModule that results in a consistently different version hash, which didn't cause any corruption or poisoning. It merely caused a cache miss more often than it theoretically needed, and has since been fixed on the task in another commit.

What remains is simply version mismatches from outdated URLs that RL automatically replaces with the correct versions, and we already correctly shorten TTL to avoid any poisoning.

Change 949611 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Reduce MAXAGE_RECOVER log severity

https://gerrit.wikimedia.org/r/949611