Clean up Cache-Control handling in MediaWiki
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tgr
	Jul 10 2020, 7:55 PM

Description

MediaWiki controllers which do not use the normal header handling of OutputPage (call OutputPage::disable(), or do not use OutputPage at all) do their own implementation of "don't cache if the user is authenticated". There are many ways to get that wrong, and several controllers do. This task is about cleaning them up.

provide a helper method (along the lines of OutputPage::sendCacheControl(), presumably) that any controller can call instead of trying to come up with their own logic
MediaWiki-extensions-CentralNotice Special:BannerLoader should not set Cache-Control: public when there's a session
MediaWiki-extensions-CentralNotice Special:HideBanners should not set Cache-Control: public for anonymous sessions
MediaWiki-extensions-WikibaseRepository Special:EntityData should not set Cache-Control: public when there's a session
action=raw should not set Cache-Control: public for anonymous sessions. (While we are there also clean up the harmless but nonsensical private + smaxage>0 combination.)
PageHistoryCountHandler should not set Cache-Control: public when there's a session
- rest.php should probably have some kind of generic facility for this

AjaxResponse (see T42787: Remove legacy ajax interface) also has code for enabling caching, but it does not seem reachable.

TODO: add other stuff here from T256395#6298796

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T257703 Clean up Cache-Control handling in MediaWiki
Open		None	T264631 Core REST API marks responses as publicly cacheable while issuing set-cookie
Open	Spike	None	T256447 Special:HideBanners is not really cacheable

Event Timeline

Tgr created this task.Jul 10 2020, 7:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 10 2020, 7:55 PM

Tgr updated the task description. (Show Details)Jul 10 2020, 8:11 PM

Tgr updated the task description. (Show Details)Jul 11 2020, 12:11 AM

Tgr updated the task description. (Show Details)Jul 11 2020, 12:57 AM

Tgr updated the task description. (Show Details)Jul 11 2020, 8:59 PM

Not sure if this is the right task to mention the issue, but I've observed this:

$ curl -v -H "Cookie: centralauth_User=REDACTED; centralauth_Token=REDACTED" -H "Host: it.wikipedia.org" 'https://appservers-ro.discovery.wmnet/w/index.php?title=MediaWiki:Templateslist.js&action=raw&ctype=text/javascript ' 2>&1 | grep Cache-Control
< Cache-Control: private, s-maxage=300, max-age=1209600

Here s-maxage=300 does not make much sense: Cache-Control: private means that the object must not be stored by a shared cache. The s-maxage directive applies to shared caches, so it should be either absent or 0 for objects with CC:private.

Tgr updated the task description. (Show Details)Jul 13 2020, 1:53 PM

sbassett added a project: FR-CentralNotice-Caching.Jul 13 2020, 9:12 PM

Tgr added a subtask: T264631: Core REST API marks responses as publicly cacheable while issuing set-cookie.Oct 6 2020, 8:26 PM

Tgr added a subtask: T256447: Special:HideBanners is not really cacheable.Oct 6 2020, 8:28 PM

Krinkle mentioned this in T279205: Harden and improve HTTP cache headers and purging in MediaWiki core (Sprint placeholder) .Apr 2 2021, 9:25 PM

Krinkle mentioned this in T285210: High frequency warning logged in production: Cookies set on {url} with Cache-Control "{cache-control}".Jun 21 2021, 6:23 PM

Krinkle added a project: Sustainability (Incident Followup).Jun 6 2022, 12:18 PM

I would add that the private / anon-public-user-private / public distinction that api.php has (and which was mentioned as a potential example to follow in T264631 for the REST API) is also not really working as described. One would assume that specifying public as the cache mode will cache the resource on the CDN for logged-in users and anonymous users as well. However, because the Vary header returned for the response is obtained via OutputPage::getVaryHeader, the response will Vary on Cookie if the wiki has any cache-varying cookies (i.e. every wiki that has user sessions), public ends up being almost identical to anon-public-user-private with the added "benefit" of creating one object variant per user on the CDN, and it won't reuse cached responses between users.

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

In T257703#8400305, @TK-999 wrote:

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

Vary headers always have to be the same for a given URL (otherwise logged-in users would get content cached for anonymous users). At a minimum you'd have to mark the entire API module as not being allowed to ever use the session (like we do for load.php), and then you could still get into all kinds of trouble because of the modular structure of the action API where e.g. the error formatter is a separate module which might check the user language. For the REST API, marking whole endpoints / URL prefixes as not using the session seems a bit more feasible.

In T257703#8401202, @Tgr wrote:

In T257703#8400305, @TK-999 wrote:

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

Vary headers always have to be the same for a given URL (otherwise logged-in users would get content cached for anonymous users). At a minimum you'd have to mark the entire API module as not being allowed to ever use the session (like we do for load.php), and then you could still get into all kinds of trouble because of the modular structure of the action API where e.g. the error formatter is a separate module which might check the user language. For the REST API, marking whole endpoints / URL prefixes as not using the session seems a bit more feasible.

Yeah, to clarify I am not proposing to change the api.php behavior due to the exact caveats you suggested—just to not adopt the same handling that the action API used for the REST handlers because it's not necessarily working as one would expect.

In T257703#8401202, @Tgr wrote:

In T257703#8400305, @TK-999 wrote:

In T155314 it was suggested that some kind of mechanism be implemented that would track whether the session user was accessed at all while processing the request. That could be an useful way of determining whether or not the current response depended on user-specific data.

Vary headers always have to be the same for a given URL (otherwise logged-in users would get content cached for anonymous users). At a minimum you'd have to mark the entire API module as not being allowed to ever use the session (like we do for load.php), and then you could still get into all kinds of trouble because of the modular structure of the action API where e.g. the error formatter is a separate module which might check the user language. For the REST API, marking whole endpoints / URL prefixes as not using the session seems a bit more feasible.

For REST, I think the main hurdle right now is that the user session is eagerly initialized for every web request anyways due to (1) $wgUser autovivification and (2) checks for potential user autocreation. It may be possible to defer (1) by injecting the context rather than the session user into StubGlobalUser, and likewise delay (2) by performing it on the first attempt to obtain the session user rather than Setup.php. This way, we could toggle a flag to "dirty" the request the first time it attempts to obtain the session user, and use that information in the REST API to determine whether we allow the response to be cached publicly. What do you think?

More broadly, we may need to revisit the practice of eagerly setting up the session itself in Setup.php as well, since that itself may set cookies even if nothing attempts to interact with the session user during the lifetime of the request.

Zabe subscribed.Jan 21 2023, 2:56 AM

In T257703#8424310, @TK-999 wrote:

More broadly, we may need to revisit the practice of eagerly setting up the session itself in Setup.php as well, since that itself may set cookies even if nothing attempts to interact with the session user during the lifetime of the request.

I had the same conclusion when analysing the related task T285210#9415537, and defering session management would make the warning Cookies set on {url} with Cache-Control "{cache-control}" more useful for developers to track actual code paths where both session is used and cache-control is declared as public.

At the same time, some session-independent API-like requests (API with action=query, index.php with action=raw, Wikibase’s Special:EntityData…) might also be marked as "should not use $wgUser" to extend a bit more the mechanism MW_NO_SESSION, and these would not set Set-Cookie if it don’t use $wgUser (and emit a log if they do).

In T257703#9416780, @Seb35 wrote:

In T257703#8424310, @TK-999 wrote:

More broadly, we may need to revisit the practice of eagerly setting up the session itself in Setup.php as well, since that itself may set cookies even if nothing attempts to interact with the session user during the lifetime of the request.

I had the same conclusion when analysing the related task T285210#9415537, and defering session management would make the warning Cookies set on {url} with Cache-Control "{cache-control}" more useful for developers to track actual code paths where both session is used and cache-control is declared as public.

At the same time, some session-independent API-like requests (API with action=query, index.php with action=raw, Wikibase’s Special:EntityData…) might also be marked as "should not use $wgUser" to extend a bit more the mechanism MW_NO_SESSION, and these would not set Set-Cookie if it don’t use $wgUser (and emit a log if they do).

Yeah, and now that $wgUser is a StubObject, that might be a good place to perform this.

That would mean the session is not available during hooks that run between the beginning of Setup.php and wherever session initialization would happen. Checking for registered or temp users would return the wrong result, messages would be rendered in the wrong language etc. Currently the only hook that runs before session initialization is MediaWikiServices and there the services are not set up yet so it's very unlikely that something session-dependent would be attempted.

The opposite approach would be to have some sort of URL routing step very early in Setup.php or the entry point which would determine whether this request needs a session. I think that's how a more modern web application would work: determining which controller handles the request is one of the first steps. MediaWiki doesn't really have controllers but it could be approximated for REST URLs.

In T257703#9417249, @Tgr wrote:

That would mean the session is not available during hooks that run between the beginning of Setup.php and wherever session initialization would happen. Checking for registered or temp users would return the wrong result, messages would be rendered in the wrong language etc. Currently the only hook that runs before session initialization is MediaWikiServices and there the services are not set up yet so it's very unlikely that something session-dependent would be attempted.

The opposite approach would be to have some sort of URL routing step very early in Setup.php or the entry point which would determine whether this request needs a session. I think that's how a more modern web application would work: determining which controller handles the request is one of the first steps. MediaWiki doesn't really have controllers but it could be approximated for REST URLs.

Could deferring session initialization until something calls RequestContext::getUser / unstubs $wgUser be a good compromise? That seems to me it should cover the case where an early hook handler attempts to do something session dependent.

Having something in place for REST that would allow handlers to specify whether they depend on a session SGTM irregardless—but we should consider the case where a handler unknowingly ends up depending on the session because of some deep dependency.

In T257703#9417698, @TK-999 wrote:

Could deferring session initialization until something calls RequestContext::getUser / unstubs $wgUser be a good compromise?

Making session initialization happen at a predictable point in time was one of the design goals of the current session system, and IMO it resulted in much easier to reason about behavior, so it would be nice to keep that.

In T257703#9417699, @TK-999 wrote:

we should consider the case where a handler unknowingly ends up depending on the session because of some deep dependency.

The point of MW_NO_SESSION is to make sure such cases break hard and force the developer to fix the dependency, instead of silently leaking user state. If you depend on the session, and you emit cache headers, you need to vary by Cookie. If you don't depend on the session, not varying probably results in better user experience. So it's kind of important for the handler to know whether it needs the session, and be accurate about it.

In T257703#9417729, @Tgr wrote:

Making session initialization happen at a predictable point in time was one of the design goals of the current session system, and IMO it resulted in much easier to reason about behavior, so it would be nice to keep that.

Makes sense, in that case deferring this only in the rest.php case should be okay (which should still result in the session being setup in a single place)

In T257703#9417699, @TK-999 wrote:

we should consider the case where a handler unknowingly ends up depending on the session because of some deep dependency.

The point of MW_NO_SESSION is to make sure such cases break hard and force the developer to fix the dependency, instead of silently leaking user state.

That sounds good to me :)

Clean up Cache-Control handling in MediaWikiOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Clean up Cache-Control handling in MediaWiki
Open, Needs TriagePublic
Actions

Related Objects
Search...