Page MenuHomePhabricator

Localisation cache must be purged after or during train deploy, not (just) before
Closed, ResolvedPublic

Description

Background

This problem was noticed when Anti-Harassment-Team deployed Special:Investigate to svwiki. Swedish translations were included in 1.36.0-wmf.10, but the English translations were appearing on the page even after the svwiki was running wmf-10.

One of the new messages was 'checkuser-investigate-tour-targets-desc'. Running echo wfMessage('checkuser-investigate-tour-targets-desc')->text(); on the server gave the Swedish translation, but ResourceLoader was serving the English translation.

@Krinkle fixed the problem by running:

MessageBlobStore::clearGlobalCacheEntry( MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache() );

Ongoing problem

Apparently the localisation caches of downstream services are being purged just before the new translations are synced to the servers, effectively delaying new translations until just before the next rollout.

Event Timeline

@Krinkle - feel free to modify or add to the task description!

Krinkle renamed this task from New message translations don't appear on production sites in the release they were added to Localisation cache must be purged after or during train deploy, not (just) before.Sep 25 2020, 9:01 PM

The problem here is with the LocalisationCache service in MediaWiki. We generate this cache (currently in CDB files, but that's not relevant) during the creation of a wmf branch for each weeks' deployment train.

When Scap generates this cache on the deployment server (prior to syncing it out to servers), the LocalisationCache service dispatches a "purge" instruction toward any services that consume LocalisationCache (LC).

Where it goes wrong is that this purge event goes to external services like Memcached and such, but by that time we've not yet started or only just begun rsyncing the localisation cache files to the mw appserver fleet. This means MW servers have ample oppertunity to serve traffic asking for these messages, read them out of their currently stale LC files, and put them right back into the caches we just purged. Then after a little while, rsync replaces the LC files but the external service will not hear another purge event until the next branch cut or scap-world deploy.

This race condition has to some extent likely always existed. It was however fairly unlikely to be noticed due to a number of things:

  • We used to do nightly localisation update deployments. This meant any issues would likely rectify within hours.
  • We used to do scap-world deploys more often, I think? I'm not sure if we stopped doing them or not, but I get the sense that we use to do a full scap deploy several times a week (at each group for some reason? or at other times/reasons?). Whereas now it seems we do them once on Tuesday and then never again. This again further highlighted the issue. Even a single scap-world deploy right after the initial roll out would suffice.

It's also possible that our recent change to automatically prepare wmf branches 12-24 hours before the actual Tuesday group0 promotion, may have effectively cemented the probability of it definitely being repopulated into downstream caches between purge and deploy. But I'm not 100% sure on that, I haven't yet looked into what we do ahead of time vs what we do during the first group0 promotion. I'm fairly sure we (no longer) do full scaps on group1 and group2, but maybe we still do during group0.

One other thing to note: there are two purges that happen as part of the scap sync-world and one seems to happen way too soon. @dduvall pointed this out to me recently that rebuildLocalisationCache.php does a purge:

https://github.com/wikimedia/mediawiki/blob/master/maintenance/rebuildLocalisationCache.php#L110

This happens before any code has been synced to any of the servers for a new version on Tuesday.

Then, also only on Tuesday or during scap sync-world, after the new versions are in place scap calls refreshMessageBlobs.php (https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/refreshMessageBlobs.php) which seems to do the same purge that happens in rebuildLocalisationCache.php.

It's also possible that our recent change to automatically prepare wmf branches 12-24 hours before the actual Tuesday group0 promotion, may have effectively cemented the probability of it definitely being repopulated into downstream caches between purge and deploy. But I'm not 100% sure on that, I haven't yet looked into what we do ahead of time vs what we do during the first group0 promotion. I'm fairly sure we (no longer) do full scaps on group1 and group2, but maybe we still do during group0.

We only do a full scap sync-world on Tuesdays. It's been true for the past 5 or so years all that's need on Wednesday and Thursday is a scap sync-wikiversions (which does nothing with l10n); although practice probably varied more broadly among train deployers until the past year or two when it's been very standardized.

Krinkle triaged this task as Medium priority.EditedNov 3 2020, 8:38 PM

Came up again. What's the minimum we can do to make this not require manual effort potentially every week?

[…] only on Tuesday or during scap sync-world, after the new versions are in place scap calls refreshMessageBlobs.php (https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/refreshMessageBlobs.php) which seems to do the same purge that happens in rebuildLocalisationCache.php.

[…] all that's need on Wednesday and Thursday is a scap sync-wikiversions (which does nothing with l10n); […]

It does something with l10n in that it swaps out the entire localisation cache for a given wiki with that of the next version, and thus requires running the LocalisationCache purge method. I don't know off-hand, but if for WMF the only thing that does is the line of code that refreshMessageBlobs runs, then perhaps we can make sync-wikiversions include this little step as well after it is done. Should take less than a second in practice and only needs to run once on one wiki (the script is wiki-agnostic).

Change 677318 had a related patch set uploaded (by Thcipriani; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Pass --offline to rebuildLocalisationCache.php

https://gerrit.wikimedia.org/r/677318

Change 677375 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rebuildLocalisationCache: Add --skip-message-purge and accompanying script

https://gerrit.wikimedia.org/r/677375

Change 677318 merged by jenkins-bot:

[mediawiki/tools/scap@master] Pass --offline to rebuildLocalisationCache.php

https://gerrit.wikimedia.org/r/677318

Change 677616 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Use --skip-message-purge instead of --offline

https://gerrit.wikimedia.org/r/677616

Change 677375 merged by jenkins-bot:

[mediawiki/core@master] rebuildLocalisationCache: Add --skip-message-purge and accompanying script

https://gerrit.wikimedia.org/r/677375

Change 678987 had a related patch set uploaded (by Legoktm; author: Krinkle):

[mediawiki/core@REL1_36] rebuildLocalisationCache: Add --skip-message-purge and accompanying script

https://gerrit.wikimedia.org/r/678987

Change 678987 merged by jenkins-bot:

[mediawiki/core@REL1_36] rebuildLocalisationCache: Add --skip-message-purge and accompanying script

https://gerrit.wikimedia.org/r/678987

Change 677616 merged by jenkins-bot:

[mediawiki/tools/scap@master] Use --skip-message-purge instead of --offline

https://gerrit.wikimedia.org/r/677616

Change 679391 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] MWScript.php: Add purgeMessageBlobStore.php to the wikiless list

https://gerrit.wikimedia.org/r/679391

Change 679391 merged by jenkins-bot:

[operations/mediawiki-config@master] MWScript.php: Add purgeMessageBlobStore.php to the wikiless list

https://gerrit.wikimedia.org/r/679391

Change 679518 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Feature flag for T263872

https://gerrit.wikimedia.org/r/679518

Change 679522 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] enable delay_messageblobstore_purge feature flag in beta scap.cfg

https://gerrit.wikimedia.org/r/679522

Change 679518 merged by jenkins-bot:

[mediawiki/tools/scap@master] Feature flag for T263872

https://gerrit.wikimedia.org/r/679518

Change 679522 merged by Effie Mouzeli:

[operations/puppet@production] enable delay_messageblobstore_purge feature flag in beta scap.cfg

https://gerrit.wikimedia.org/r/679522

dancy removed dancy as the assignee of this task.Aug 17 2022, 4:20 PM
dancy subscribed.

Change 825835 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Call tasks.clear_message_blobs after restarting php-fpm

https://gerrit.wikimedia.org/r/825835

Change 825835 merged by jenkins-bot:

[mediawiki/tools/scap@master] Call tasks.clear_message_blobs after restarting php-fpm

https://gerrit.wikimedia.org/r/825835

Func added subscribers: H78c67c, Func.

I can only reproduce this intermittently, but for example when trying to add a new topic in talk pages through DiscussionTools, the placeholder for the field Title (which should be 標題) is sometimes shown in the original English form. And when creating new pages with the 2010 editor, the toolbar items (進階, 特別字元. 幫手 and 引用) are sometimes shown as "Advanced", "Special characters", "Help" and "Cite".

We added yue-hans and yue-hant languages in commit 504c1a9 within 1.41.0-wmf.10, but the messages served from ResourceLoader didn't follow the fallback chain to yue, instead it did shallow fallback to en. It seems the message blob cache didn't rebuild properly after the new LocalisationCache and/or MessageCache populated.

Not sure if that deployment train had been reverted once is the cause for us to observe this issue. Did it rebuild after the train is reverted and persists when promoted to wmf.10 again?

Maybe rebuilding the message blob cache only with scap sync-world is not enough?

dancy removed dancy as the assignee of this task.Sep 18 2023, 5:14 PM
matmarex assigned this task to dancy.
matmarex subscribed.

@Func I don't know what the problem you ran into was, but it was almost a year after the work here was done, so I would say that's probably something different. If you see it again, please file a task.