Page MenuHomePhabricator

Make index.php?action=raw CDN cacheable for both logged-in and logged-out for import script use case
Open, MediumPublic

Description

When scripts load cross-domain from load.php (e.g. enwiki loading a gadget from mediawiki.org and a GlobalCssJs script from Meta-Wiki), everything works great. It is well-cached. It's secure (only public stuff is served). And it emits the proper headers to allow browsers to report errors with full details, because the CORS headers we emit from load.php basically tell browsers that we ignored any cookies the user might have and that they do not affect the source code, and thus it is okay to tell en.wikipedia.org (or third party domains) about any strings or other content within the scripts we serve.

But, for index.php?action=raw this is not the case.

When loading it plainly as a <script src=…> across domains, it is allowed to run but is treated as authenticated cross-domain content, similar to how hotlinked images are treated by default. They are allowed to download and run, but no details about the network transfer or the script content are meant to leak.

Also, when logged-in, the scripts are never cached as precaution since on private wikis these could contain non-public details.

Objective

In the common case of any user (logged-in or not) on public wiki A importing a known publicly-visible revision from public wiki B, as a script, the browser should emit Access-Control-Allow-Origin: * so that stack traces are not redacted by the browser.

It'd be nice if we can also make it so that these can be cached for logged-in users, but that's a stretch goal and probably not feasible without VCL changes in the WMF traffic edge, which probably isn't worth it. We do still allow the browser to cache them locally and use If-Not-Modified/304 checks to reduce transfers.

See also:

Event Timeline

Anecdotally it seems we briefly considered removing this endpoint in 2011 (r89398, reverted in r89399). Given its necessity for importing user scripts (at least until Gadgets 3.0), this probably still not viable. However, we could consider stripping it down significantly.

For example, we could remove:

  • support for maxage and smaxage parameters. (We already cache by default, and limit the amount of caching, and purge canonical importScript-made URLs. There is no need for anyone to shorten their own cache, or to ask to increase it which we automatically ignore).
  • support for content besides js or css. (Use API for general wikitext/source retrieval.)
  • support for non-public revisions. (Use API for general wikitext/source retrieval.)
  • support for oldid parameter. (Use API for general wikitext/source retrieval.)

Change 676646 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] RawAction: Simplify and stricten the $privateCache check

https://gerrit.wikimedia.org/r/676646

Change 676647 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] RawAction: Ignore 'gen' query parameter for Content-Type

https://gerrit.wikimedia.org/r/676647

Change 676648 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] RawAction: Handle 40x errors earlier in the code

https://gerrit.wikimedia.org/r/676648

Change 676649 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] RawAction: Use CDN maxage, remove wgForcedRawSMaxage, ignore maxage/smaxage

https://gerrit.wikimedia.org/r/676649

Random note:

… at WMF, about 30% of action=raw is spent computing EchoSeenTime, because of RawAction re-using OutputPage::checkLastModified, which Echo hooks into to decide whether to miss the browser cache to show you an Echo bell - except those don't appear on things like action=raw or action=render.

It would be good to have usage statistics and go through the API deprecation process for any breaking changes because action=raw effectively is an API. There are some legacy bots (including a few run by yours truly :() that still use it because at the time they were written, it was the most straightforward way to get wikitext.

@Legoktm Sounds good yeah, let's do that. To confirm - this bot uses it to obtain the latest public plain unparsed wikitext, not e.g. for private wikis, deleted revisions, or with preprocessed template expansion? (Those are the features I'd like to deprecate, but not sure it is actually needed for this task specifically.)

Krinkle renamed this task from Misc changes to index.php?action=raw for perf and error logging to Make index.php?action=raw CDN cacheable for both logged-in and logged-out for import script use case.Oct 4 2022, 8:39 PM
Krinkle triaged this task as Medium priority.