Page MenuHomePhabricator

Localisation cache must be purged after or during train deploy, not (just) before
Open, MediumPublic

Description

Background

This problem was noticed when Anti-Harassment deployed Special:Investigate to svwiki. Swedish translations were included in 1.36.0-wmf.10, but the English translations were appearing on the page even after the svwiki was running wmf-10.

One of the new messages was 'checkuser-investigate-tour-targets-desc'. Running echo wfMessage('checkuser-investigate-tour-targets-desc')->text(); on the server gave the Swedish translation, but ResourceLoader was serving the English translation.

@Krinkle fixed the problem by running:

MessageBlobStore::clearGlobalCacheEntry( MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache() );

Ongoing problem

Apparently the localisation caches of downstream services are being purged just before the new translations are synced to the servers, effectively delaying new translations until just before the next rollout.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2020, 5:55 PM

@Krinkle - feel free to modify or add to the task description!

Krinkle renamed this task from New message translations don't appear on production sites in the release they were added to Localisation cache must be purged after or during train deploy, not (just) before.Sep 25 2020, 9:01 PM
Krinkle added a comment.EditedSep 25 2020, 9:53 PM

The problem here is with the LocalisationCache service in MediaWiki. We generate this cache (currently in CDB files, but that's not relevant) during the creation of a wmf branch for each weeks' deployment train.

When Scap generates this cache on the deployment server (prior to syncing it out to servers), the LocalisationCache service dispatches a "purge" instruction toward any services that consume LocalisationCache (LC).

Where it goes wrong is that this purge event goes to external services like Memcached and such, but by that time we've not yet started or only just begun rsyncing the localisation cache files to the mw appserver fleet. This means MW servers have ample oppertunity to serve traffic asking for these messages, read them out of their currently stale LC files, and put them right back into the caches we just purged. Then after a little while, rsync replaces the LC files but the external service will not hear another purge event until the next branch cut or scap-world deploy.

This race condition has to some extent likely always existed. It was however fairly unlikely to be noticed due to a number of things:

  • We used to do nightly localisation update deployments. This meant any issues would likely rectify within hours.
  • We used to do scap-world deploys more often, I think? I'm not sure if we stopped doing them or not, but I get the sense that we use to do a full scap deploy several times a week (at each group for some reason? or at other times/reasons?). Whereas now it seems we do them once on Tuesday and then never again. This again further highlighted the issue. Even a single scap-world deploy right after the initial roll out would suffice.

It's also possible that our recent change to automatically prepare wmf branches 12-24 hours before the actual Tuesday group0 promotion, may have effectively cemented the probability of it definitely being repopulated into downstream caches between purge and deploy. But I'm not 100% sure on that, I haven't yet looked into what we do ahead of time vs what we do during the first group0 promotion. I'm fairly sure we (no longer) do full scaps on group1 and group2, but maybe we still do during group0.

One other thing to note: there are two purges that happen as part of the scap sync-world and one seems to happen way too soon. @dduvall pointed this out to me recently that rebuildLocalisationCache.php does a purge:

https://github.com/wikimedia/mediawiki/blob/master/maintenance/rebuildLocalisationCache.php#L110

This happens before any code has been synced to any of the servers for a new version on Tuesday.

Then, also only on Tuesday or during scap sync-world, after the new versions are in place scap calls refreshMessageBlobs.php (https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/refreshMessageBlobs.php) which seems to do the same purge that happens in rebuildLocalisationCache.php.

It's also possible that our recent change to automatically prepare wmf branches 12-24 hours before the actual Tuesday group0 promotion, may have effectively cemented the probability of it definitely being repopulated into downstream caches between purge and deploy. But I'm not 100% sure on that, I haven't yet looked into what we do ahead of time vs what we do during the first group0 promotion. I'm fairly sure we (no longer) do full scaps on group1 and group2, but maybe we still do during group0.

We only do a full scap sync-world on Tuesdays. It's been true for the past 5 or so years all that's need on Wednesday and Thursday is a scap sync-wikiversions (which does nothing with l10n); although practice probably varied more broadly among train deployers until the past year or two when it's been very standardized.

Krinkle triaged this task as Medium priority.EditedTue, Nov 3, 8:38 PM

Came up again. What's the minimum we can do to make this not require manual effort potentially every week?

[…] only on Tuesday or during scap sync-world, after the new versions are in place scap calls refreshMessageBlobs.php (https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/refreshMessageBlobs.php) which seems to do the same purge that happens in rebuildLocalisationCache.php.

[…] all that's need on Wednesday and Thursday is a scap sync-wikiversions (which does nothing with l10n); […]

It does something with l10n in that it swaps out the entire localisation cache for a given wiki with that of the next version, and thus requires running the LocalisationCache purge method. I don't know off-hand, but if for WMF the only thing that does is the line of code that refreshMessageBlobs runs, then perhaps we can make sync-wikiversions include this little step as well after it is done. Should take less than a second in practice and only needs to run once on one wiki (the script is wiki-agnostic).