Page MenuHomePhabricator

Clean up Cache-Control handling in MediaWiki
Open, Needs TriagePublic

Description

MediaWiki controllers which do not use the normal header handling of OutputPage (call OutputPage::disable(), or do not use OutputPage at all) do their own implementation of "don't cache if the user is authenticated". There are many ways to get that wrong, and several controllers do. This task is about cleaning them up.

AjaxResponse (see T42787: Remove legacy ajax interface) also has code for enabling caching, but it does not seem reachable.

TODO: add other stuff here from T256395#6298796

Event Timeline

Not sure if this is the right task to mention the issue, but I've observed this:

$ curl -v -H "Cookie: centralauth_User=REDACTED; centralauth_Token=REDACTED" -H "Host: it.wikipedia.org" 'https://appservers-ro.discovery.wmnet/w/index.php?title=MediaWiki:Templateslist.js&action=raw&ctype=text/javascript ' 2>&1 | grep Cache-Control
< Cache-Control: private, s-maxage=300, max-age=1209600

Here s-maxage=300 does not make much sense: Cache-Control: private means that the object must not be stored by a shared cache. The s-maxage directive applies to shared caches, so it should be either absent or 0 for objects with CC:private.

I would add that the private / anon-public-user-private / public distinction that api.php has (and which was mentioned as a potential example to follow in T264631 for the REST API) is also not really working as described. One would assume that specifying public as the cache mode will cache the resource on the CDN for logged-in users and anonymous users as well. However, because the Vary header returned for the response is obtained via OutputPage::getVaryHeader, the response will Vary on Cookie if the wiki has any cache-varying cookies (i.e. every wiki that has user sessions), public ends up being almost identical to anon-public-user-private with the added "benefit" of creating one object variant per user on the CDN, and it won't reuse cached responses between users.

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

Vary headers always have to be the same for a given URL (otherwise logged-in users would get content cached for anonymous users). At a minimum you'd have to mark the entire API module as not being allowed to ever use the session (like we do for load.php), and then you could still get into all kinds of trouble because of the modular structure of the action API where e.g. the error formatter is a separate module which might check the user language. For the REST API, marking whole endpoints / URL prefixes as not using the session seems a bit more feasible.

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

Vary headers always have to be the same for a given URL (otherwise logged-in users would get content cached for anonymous users). At a minimum you'd have to mark the entire API module as not being allowed to ever use the session (like we do for load.php), and then you could still get into all kinds of trouble because of the modular structure of the action API where e.g. the error formatter is a separate module which might check the user language. For the REST API, marking whole endpoints / URL prefixes as not using the session seems a bit more feasible.

Yeah, to clarify I am not proposing to change the api.php behavior due to the exact caveats you suggested—just to not adopt the same handling that the action API used for the REST handlers because it's not necessarily working as one would expect.

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

Vary headers always have to be the same for a given URL (otherwise logged-in users would get content cached for anonymous users). At a minimum you'd have to mark the entire API module as not being allowed to ever use the session (like we do for load.php), and then you could still get into all kinds of trouble because of the modular structure of the action API where e.g. the error formatter is a separate module which might check the user language. For the REST API, marking whole endpoints / URL prefixes as not using the session seems a bit more feasible.

For REST, I think the main hurdle right now is that the user session is eagerly initialized for every web request anyways due to (1) $wgUser autovivification and (2) checks for potential user autocreation. It may be possible to defer (1) by injecting the context rather than the session user into StubGlobalUser, and likewise delay (2) by performing it on the first attempt to obtain the session user rather than Setup.php. This way, we could toggle a flag to "dirty" the request the first time it attempts to obtain the session user, and use that information in the REST API to determine whether we allow the response to be cached publicly. What do you think?

More broadly, we may need to revisit the practice of eagerly setting up the session itself in Setup.php as well, since that itself may set cookies even if nothing attempts to interact with the session user during the lifetime of the request.

More broadly, we may need to revisit the practice of eagerly setting up the session itself in Setup.php as well, since that itself may set cookies even if nothing attempts to interact with the session user during the lifetime of the request.

I had the same conclusion when analysing the related task T285210#9415537, and defering session management would make the warning Cookies set on {url} with Cache-Control "{cache-control}" more useful for developers to track actual code paths where both session is used and cache-control is declared as public.

At the same time, some session-independent API-like requests (API with action=query, index.php with action=raw, Wikibase’s Special:EntityData…) might also be marked as "should not use $wgUser" to extend a bit more the mechanism MW_NO_SESSION, and these would not set Set-Cookie if it don’t use $wgUser (and emit a log if they do).

More broadly, we may need to revisit the practice of eagerly setting up the session itself in Setup.php as well, since that itself may set cookies even if nothing attempts to interact with the session user during the lifetime of the request.

I had the same conclusion when analysing the related task T285210#9415537, and defering session management would make the warning Cookies set on {url} with Cache-Control "{cache-control}" more useful for developers to track actual code paths where both session is used and cache-control is declared as public.

At the same time, some session-independent API-like requests (API with action=query, index.php with action=raw, Wikibase’s Special:EntityData…) might also be marked as "should not use $wgUser" to extend a bit more the mechanism MW_NO_SESSION, and these would not set Set-Cookie if it don’t use $wgUser (and emit a log if they do).

Yeah, and now that $wgUser is a StubObject, that might be a good place to perform this.

That would mean the session is not available during hooks that run between the beginning of Setup.php and wherever session initialization would happen. Checking for registered or temp users would return the wrong result, messages would be rendered in the wrong language etc. Currently the only hook that runs before session initialization is MediaWikiServices and there the services are not set up yet so it's very unlikely that something session-dependent would be attempted.

The opposite approach would be to have some sort of URL routing step very early in Setup.php or the entry point which would determine whether this request needs a session. I think that's how a more modern web application would work: determining which controller handles the request is one of the first steps. MediaWiki doesn't really have controllers but it could be approximated for REST URLs.

That would mean the session is not available during hooks that run between the beginning of Setup.php and wherever session initialization would happen. Checking for registered or temp users would return the wrong result, messages would be rendered in the wrong language etc. Currently the only hook that runs before session initialization is MediaWikiServices and there the services are not set up yet so it's very unlikely that something session-dependent would be attempted.

The opposite approach would be to have some sort of URL routing step very early in Setup.php or the entry point which would determine whether this request needs a session. I think that's how a more modern web application would work: determining which controller handles the request is one of the first steps. MediaWiki doesn't really have controllers but it could be approximated for REST URLs.

Could deferring session initialization until something calls RequestContext::getUser / unstubs $wgUser be a good compromise? That seems to me it should cover the case where an early hook handler attempts to do something session dependent.

Having something in place for REST that would allow handlers to specify whether they depend on a session SGTM irregardless—but we should consider the case where a handler unknowingly ends up depending on the session because of some deep dependency.

Could deferring session initialization until something calls RequestContext::getUser / unstubs $wgUser be a good compromise?

Making session initialization happen at a predictable point in time was one of the design goals of the current session system, and IMO it resulted in much easier to reason about behavior, so it would be nice to keep that.

we should consider the case where a handler unknowingly ends up depending on the session because of some deep dependency.

The point of MW_NO_SESSION is to make sure such cases break hard and force the developer to fix the dependency, instead of silently leaking user state. If you depend on the session, and you emit cache headers, you need to vary by Cookie. If you don't depend on the session, not varying probably results in better user experience. So it's kind of important for the handler to know whether it needs the session, and be accurate about it.

Making session initialization happen at a predictable point in time was one of the design goals of the current session system, and IMO it resulted in much easier to reason about behavior, so it would be nice to keep that.

Makes sense, in that case deferring this only in the rest.php case should be okay (which should still result in the session being setup in a single place)

we should consider the case where a handler unknowingly ends up depending on the session because of some deep dependency.

The point of MW_NO_SESSION is to make sure such cases break hard and force the developer to fix the dependency, instead of silently leaking user state.

That sounds good to me :)