Page MenuHomePhabricator

Sort out HTTP caching issues for fixcopyright wiki
Closed, ResolvedPublic

Description

Noted incidentally at the bottom of T203095#4544592 was this:

There are some comments about accept-language header via ULS. Keep in mind this is disabled on Wikimedia Wikis (via $wgULSAnonCanChangeLanguage) and generally requires specialized varnish config to work. Have you checked with Traffic that this language workflow is workable.

Since we've never turned on $wgULSAnonCanChangeLanguage here before, I don't actually know what's involved in supporting it. From the brief text above, I roughly assume wikis that turns this on will probably emit a Vary: Accept-Language header for affected URLs, and that we'll probably need to normalize clients' Accept-Language according to rules similar to whatever ULS uses (e.g. pick first supported entry) for performance reasons. Someone will have to dig into this a bit and figure it out, and apparently there's not much time left to do so!

Event Timeline

BBlack triaged this task as High priority.Aug 30 2018, 4:28 PM
BBlack created this task.

There are two similar but different things, and I want to clarify first exactly which one we plan on enabling. $wgULSAnonCanChangeLanguage allows anonymous users to set their preferred interface language in a cookie and then have that cookie read for future requests. $wgULSLanguageDetection reads Accept-Language headers and uses that for the interface language.

I think what we really want is $wgULSLanguageDetection, because the workflow of using the site is a single page (as far as I can tell), so I don't see much value in letting them save their preferred language. But automatically picking the language based on the Accept-Language header seems useful. As far as I can tell, ULS outputs no header when relying upon Accept-Language, it sounds like you're saying that it should be outputting Vary: Accept-Language?

As for the implementation,

  • WebRequest::getAcceptLang() parses the header, and returns a sorted array of lowercase lang code -> priority
  • UniversalLanguageSelectorHooks::getDefaultLanguage() makes two passes:
    • Iterate in priority order, if the language code matches one in MediaWiki, use that language code
    • Iterate a second time in priority order, for codes like de-xx, use the first part (de), and see if that matches one in MediaWiki, then use that language code
  • Fallback to default content language if none were found.

In total its about 50 lines of PHP. If we had to reimplement it in varnish, we'd need to hardcode the list of languages that MediaWiki supports (I guess OK for a one-off thing).

I did notice that because this setup is using special page transclusion, it will only be cached in varnish and parser cache for 1 hour (AIUI).

CC'd some people who are also familiar with this stuff and can correct me if I'm wrong.

I'm not familiar with VCL, but parsing the accept-language header (correctly) is more complex than most headers.

Are people going to be directed to this page via a CentralNotice banner? If so we already know the language and could just add uselang to the url in the banner and skip all this. Given the tight time constraints i'd be in favour of that if it meets the requirements

As far as I can tell, ULS outputs no header when relying upon Accept-Language, it sounds like you're saying that it should be outputting Vary: Accept-Language?

Right. If MW varies its output text based on the UA's Accept-Language, but emits no Vary: Accept-Language header, the default outcome (with any cache) will be that caching will make a mess of the language outputs. The first user to visit /foo might get it in German, and then everyone else will get cached German copy regardless of their Accept-Language header until the object expires, because the application layer isn't consulted for a cache hit. The Vary: Accept-Language header is the standards-based way to deal with this sort of problem, and instructs the cache that it should store separate cache objects for every unique Accept-Language value seen from users, only sharing cache hits between separate UAs when the UAs' AL values match. Adding that should probably addressed in the ULS code for the general case.

In the meantime, if it's more expedient we can also hack that in manually in Varnish, effectively faking the Vary: Accept-Language output for some or all pages on this domain fixcopyright.wikimedia.org with a fairly trivial change to our VCL. With just this change, things will work correctly, even if they're not as efficient as they could be. That takes us into this next bit:

As for the implementation,

  • WebRequest::getAcceptLang() parses the header, and returns a sorted array of lowercase lang code -> priority
  • UniversalLanguageSelectorHooks::getDefaultLanguage() makes two passes:
    • Iterate in priority order, if the language code matches one in MediaWiki, use that language code
    • Iterate a second time in priority order, for codes like de-xx, use the first part (de), and see if that matches one in MediaWiki, then use that language code
  • Fallback to default content language if none were found.

In total its about 50 lines of PHP. If we had to reimplement it in varnish, we'd need to hardcode the list of languages that MediaWiki supports (I guess OK for a one-off thing).

When it comes to the performance of the Vary: AL scheme, this part comes into play. UAs are not very consistent in what they send in the AL string.

So for example, we might have 5 different classes of UA which give us these distinct AL headers:

Accept-Language: en
Accept-Language: EN
Accept-Language: en, de;q=0.8, fr, fr-CH
Accept-Language: Zh-cn, en, fr, de
Accept-Language: en-US,en;q=0.5

But the PHP code in ULS ends up resolving all of these, for a particular page with particular available CLs, as the English output. With just Vary: A-L the English output will be cached in 5 distinct cache slots separately, one for each unique possible AL string from UAs that ends up mapped to the English output. This doesn't hurt correctness, but it does hamper hitrates and performance. If ULS rules are relatively simple, we can put a chunk of A-L normalization code in Varnish's VCL that matches ULS and replaces all these variants with a consistent Accept-Language: en which matches what ULS would've chosen, for both Vary-slotting and the ULS code to see (at which point its own parsing becomes somewhat redundant in our setup).

I did notice that because this setup is using special page transclusion, it will only be cached in varnish and parser cache for 1 hour (AIUI).

We can live with 1h, but it's not ideal. The hotter this is expected to be, the more we'd like the values to be ~24h.

Change 456650 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/UniversalLanguageSelector@master] Vary caching on Accept-Language header if it is used

https://gerrit.wikimedia.org/r/456650

Change 456656 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_text: inject Vary:AL for fixcopyrightwiki

https://gerrit.wikimedia.org/r/456656

^ I'm going to merge this up shortly. It's pretty un-dangerous to other traffic and it ensures the ULS Accept-Language stuff won't have any functional/correctness problems. After that there shouldn't be any functional blocker for getting the new wiki up here, but we may want to later follow up on other general issues for supporting ULS better.

Change 456656 merged by BBlack:
[operations/puppet@production] cache_text: inject Vary:AL for fixcopyrightwiki

https://gerrit.wikimedia.org/r/456656

Change 458071 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/mediawiki-config@master] Enable $wgULSLanguageDetection for fixcopyrightwiki

https://gerrit.wikimedia.org/r/458071

[15:43:27] <legoktm> bblack: hmm, I'm seeing weird stuff re: A-L & fixcopyrightwiki
[15:43:33] <legoktm> km@km-pt ~> curl -I "https://fixcopyright.wikimedia.org/wiki/Main_Page" | grep vary
[15:43:33] <legoktm> vary: Accept-Encoding,Cookie,Accept-Language,Accept-Language,Accept-Language
[15:43:40] <legoktm> is it supposed to be there 3 times?
[15:43:54] <legoktm> note that MW isn't even outputting vary: A-L yet, so I guess that would add a fourth

Change 458077 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] fixcopyright: avoid duplicate Vary:A-L

https://gerrit.wikimedia.org/r/458077

Change 458077 merged by BBlack:
[operations/puppet@production] fixcopyright: avoid duplicate Vary:A-L

https://gerrit.wikimedia.org/r/458077

Should be fixed now, pending caches clearing out old results. I don't think it actually harms anything in the meantime.

Change 456650 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Vary caching on Accept-Language header if it is used

https://gerrit.wikimedia.org/r/456650

Change 458071 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable $wgULSLanguageDetection for fixcopyrightwiki

https://gerrit.wikimedia.org/r/458071

I didn't read the ULS code properly, it looks like we need $wgULSAnonCanChangeLanguage to be enabled...but I don't think we want to be setting 'language' cookies for people right? Will need a bit of refactoring in ULS to make this work.

Great :) And the special page TTL was fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/457097 to be 24h.

I mentioned this on irc, but my mistake I misunderstood what that does. Since this is extremely quick load, adaptive parser cache stuff comes into play, and it ends up being parser cached for only 15 seconds (Probably fine, I doubt that the parser cache is helping much here in any case as php part is dead simple).

Varnish seems to be caching it fine (Assuming you don't have cookies, which I imagine pretty much nobody would as not SUL). However, if you use language selector you end up at https://fixcopyright.wikimedia.org/wiki/Fix_copyright?title=Fix_copyright&uselang=fr which seems to have an X-Cache-Status of pass, so I guess that is not varnish cached.

Varnish seems to be caching it fine (Assuming you don't have cookies, which I imagine pretty much nobody would as not SUL). However, if you use language selector you end up at https://fixcopyright.wikimedia.org/wiki/Fix_copyright?title=Fix_copyright&uselang=fr which seems to have an X-Cache-Status of pass, so I guess that is not varnish cached.

Which is because of the following snippet in MediaWiki.php

if ( $this->config->get( 'UseSquid' ) && 
        in_array(
                // Use PROTO_INTERNAL because that's what getCdnUrls() uses
                wfExpandUrl( $request->getRequestURL(), PROTO_INTERNAL ),
                $requestTitle->getCdnUrls()
        )    
) {  
        $output->setCdnMaxage( $this->config->get( 'SquidMaxage' ) ); 
}

The url is neither canoncial or one of the getCdnUrls(), so its not cached for fear that we won't purge it upon edit.

Varnish seems to be caching it fine (Assuming you don't have cookies, which I imagine pretty much nobody would as not SUL).

Almost everyone has cookies. If nothing else, GeoIP cookies and WMF-Last-Access cookies by the time they reach their second pageview. In a quick look, all outputs seem to contain Vary: Cookie.

It seems like there are 3 basic possible ways one could vary the language output of a URL for a user:

  1. Send them to a different URL (which is effectively what happens with &uselang=fr, but on top of that we don't purge it and don't allow it to be cached)
  2. Set some kind of language cookies and Vary:Cookie to split the outputs (more complicated, because Caches won't be effective unless they also have code put in them to understand your language-cookie and separate it from other cookies for variance purposes. We already have some custom Vary:Cookie hacks to handle MW session cookies in Varnish, which this would all have to sanely blend with...)
  3. Set Vary:A-L and vary on the client's A-L, and optionally have the caches normalize the incoming A-L headers for performance.

But what we have in our output at present seems to be the worst possible blend of all 3? We're changing the URL and making it uncacheable, and also sending Vary:Cookie (but without a language-carrying cookie anyways and without the cache-level support for it), and also hacking in Vary:A-L until that part goes live.

However, if we use ULS to set the language, the assumption might be that people will touch the language selector rarely, so its not a common case to get to the other url.

Change 458323 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/UniversalLanguageSelector@master] Allow $wgULSLanguageDetection to work if $wgULSAnonCanChangeLanguage is false

https://gerrit.wikimedia.org/r/458323

Varnish seems to be caching it fine (Assuming you don't have cookies, which I imagine pretty much nobody would as not SUL).

Almost everyone has cookies. If nothing else, GeoIP cookies and WMF-Last-Access cookies by the time they reach their second pageview. In a quick look, all outputs seem to contain Vary: Cookie.

It seems like there are 3 basic possible ways one could vary the language output of a URL for a user:

  1. Send them to a different URL (which is effectively what happens with &uselang=fr, but on top of that we don't purge it and don't allow it to be cached)
  2. Set some kind of language cookies and Vary:Cookie to split the outputs (more complicated, because Caches won't be effective unless they also have code put in them to understand your language-cookie and separate it from other cookies for variance purposes. We already have some custom Vary:Cookie hacks to handle MW session cookies in Varnish, which this would all have to sanely blend with...)
  3. Set Vary:A-L and vary on the client's A-L, and optionally have the caches normalize the incoming A-L headers for performance.

Ideally #3 will take care of the majority of users, and #1 will be a fallback for the rest.

But what we have in our output at present seems to be the worst possible blend of all 3? We're changing the URL and making it uncacheable, and also sending Vary:Cookie (but without a language-carrying cookie anyways and without the cache-level support for it), and also hacking in Vary:A-L until that part goes live.

As far as I can tell, it looks like MediaWiki is always setting Vary: Cookie? I don't think that is specific to the stuff deployed on fixcopyright.wm.o.

Change 458346 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/UniversalLanguageSelector@wmf/1.32.0-wmf.20] Allow $wgULSLanguageDetection to work if $wgULSAnonCanChangeLanguage is false

https://gerrit.wikimedia.org/r/458346

Change 458347 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/UniversalLanguageSelector@wmf/1.32.0-wmf.20] Vary caching on Accept-Language header if it is used

https://gerrit.wikimedia.org/r/458347

Change 458323 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Allow $wgULSLanguageDetection to work if $wgULSAnonCanChangeLanguage is false

https://gerrit.wikimedia.org/r/458323

Change 458346 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@wmf/1.32.0-wmf.20] Allow $wgULSLanguageDetection to work if $wgULSAnonCanChangeLanguage is false

https://gerrit.wikimedia.org/r/458346

Change 458347 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@wmf/1.32.0-wmf.20] Vary caching on Accept-Language header if it is used

https://gerrit.wikimedia.org/r/458347

After those ULS patches, the current status is that MW is setting Vary: Accept-Language unconditionally (I'm pretty sure, but since varnish is unconditionally setting it I'm not sure how to verify that MW is setting it as well), and will change language based on the header (tested with curl -H "Accept-Language: de" "https://fixcopyright.wikimedia.org/").

The language selectors are generating URLs with ?uselang=XX, which is not cached. We don't expect most people to actually use the lang selector so this should be OK. We can check usage of these URLs in the web request logs (I don't have access, just assuming those would have this data) after a day to make sure this expectation is correct.

We're also setting Vary: Accept-Encoding and Cookie. It sounds like Cookie might be problematic?

The language selectors are generating URLs with ?uselang=XX

Why are you not using (jquery.)ULS? I was very confused why Allow $wgULSLanguageDetection to work if $wgULSAnonCanChangeLanguage is false is even a thing, because it makes no sense to force a language on people and not allow them to change it. It's not "universal language selector" if it is not used universally :)

CCicalese_WMF claimed this task.

I'm going to mark this as resolved, since the campaign proper is over (although a static message will remain on the site for the time being). If there are any open issues resulting from the discussion above, they should be opened as separate tasks.

Change 519606 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: remove Vary:AL workaround for fixcopyright.wm.org

https://gerrit.wikimedia.org/r/519606

Change 519606 merged by Ema:
[operations/puppet@production] vcl: remove Vary:AL workaround for fixcopyright.wm.org

https://gerrit.wikimedia.org/r/519606