Page MenuHomePhabricator

Scap deployments are not purging MessageBlobStore (was: Stale localized messages)
Closed, ResolvedPublic

Description

It seems like nowiki does not use w:no:MediaWiki:Rcfilters-show-new-changes, compare nowiki with enwiki.

To clarify; this is enwiki with English sysmsg (note that it has a dtg) and this is enwiki with Norwegian Bokmål sysmsg (note no dtg).

Event Timeline

Restricted Application added subscribers: Danmichaelo, Aklapper. · View Herald Transcript
Aklapper changed the task status from Open to Stalled.May 5 2019, 4:53 PM

Cannot reproduce. Your link shows Vis de nyeste endringene, but it takes a long time until it is displayed.
Does the problem also happen with safemode?

Screenshot from 2019-05-05 18-51-32.png (466×960 px, 100 KB)

Actually, you did reproduce the bug.

The sysmsg is Vis nye endringer siden $1, but the shown message is Vis de nyeste endingene. If you open the same special page at enwiki you will get the correct message.

Note that using uselang=qqx will give (Rcfilters-show-new-changes:…

Tried to go to enwiki, which has the timestamp, and then change the English language to Norwegian Bokmål, and the timestamp goes away.

@Nikerabbit Could this be an error due to the translate update being turned off for a short time?

Aklapper renamed this task from Special page recent changes failes to use MediaWiki:Rcfilters-show-new-changes to Special:RecentChanges on no.wp ignores localized MediaWiki:Rcfilters-show-new-changes message.May 5 2019, 7:39 PM
Aklapper changed the task status from Stalled to Open.

As I wrote as clarification, this also happen on enwiki, i follows the user language. Just click on the links and wait a few minutes. Most likely the message cache is somehow stuck in an outdated state.

It also exists at several other languages; da, nn, sv, fi…

Nikerabbit added a subscriber: Krinkle.

It's not translatewiki.net: https://github.com/wikimedia/mediawiki/blob/wmf/1.34.0-wmf.3/languages/i18n/nb.json#L1451

It's unlikely to be stale MessageCache, because the message is not overridden in the wiki.

It could be stale LocalisationCache, but it is not:

nikerabbit@deploy1001:~$ mwscript eval.php nowiki
> echo wfMessage( 'rcfilters-show-new-changes' )->plain();
Vis nye endringer siden $1

So it is likely stale resource loader blob cache (or whatever is the correct name for it). We have had this before I think.

mw.loader.implement( "mediawiki.rcfilters.filters.ui@0hljuy5", {
...
{...,"rcfilters-show-new-changes":"Vis de nyeste endringene",...} );
}

For some reason, this particular message has failed to update on every wiki that it's been deployed to. I'll force-update it in about half an hour, and will investigate why this happened after.

@Catrope We're watching this on the radar for now, but after your investigation feel free to ping or move back to Inbox for re-triage if we need to patch/CR something.

It turns out MessageBlobStore's normal update mechanism for when messages change was disabled in the WMF config by rOMWC7f1a3bc742ed: Disable MessageBlobStore::clear() via hook, because of T29320: MessageBlobStore::clear() causes scaling problems on multi-server setups with CDB l10ncache. However, that bug might be moot since MessageBlobStore no longer uses the database. I think this probably broke when we stopped running LocalisationUpdate, because the only thing that appears to run refreshMessageBlobs.php (referenced in the comment in that config patch) is l10nupdate-1, which I don't believe we run anymore.

Mentioned in SAL (#wikimedia-operations) [2019-05-06T22:43:05Z] <RoanKattouw> Running refreshMessageBlobs.php on all wikis for T222539

Since the reason that clearing the MessageBlobStore from LocalisationCache was originally disabled is now moot (MBS isn't DB-based anymore), and since MessageBlobStore::clear() is just a touchCheckKey() call nowadays, I think we can safely revert rOMWC7f1a3bc742ed: Disable MessageBlobStore::clear() via hook. I will upload a patch to do that now.

Wow, nice catch!

Indeed, refreshMessageBlobs.php normally ran as part of the localisation update cron, which is disabled at the moment.

However, while we don't update localisation cache from cron anymore, we do still update it as part of Scap deployments (naturally, to update localisation caches). But, we don't run refreshMessageBlobs.php in Scap. We should either run it there, too, or (my recommendation) re-enable $blobStore->clear() and remove support for disabling it.

One thing to keep in mind there is timing. We need to make sure the purge starts after (or lasts until) all servers have been updated as they might otherwise repopulate from stale files. As such, it might still be worth re-using the maintenance script instead for that purpose as last step in Scap. Or, we can check whether there is a tombstone involved and whether it lasts long-enough, but I'm not sure we can rely on that (~ 11 seconds, last I checked, which seems risky).

Krinkle renamed this task from Special:RecentChanges on no.wp ignores localized MediaWiki:Rcfilters-show-new-changes message to Scap deployments are not purging MessageBlobStore (was: Stale localized messages).May 6 2019, 11:23 PM
Krinkle edited projects, added Regression; removed Edit-Review-Improvements-RC-Page.

As I was talking this through with @Krinkle on IRC, he pointed out that in WMF production, we only run rebuildLocalisationCache.php once, on the deployment server, and then distribute the CDB files it generates to the application servers(*). rebuildLocalisationCache.php doesn't run on each app server. Together with $wgLocalisationCacheConf['manualRecache'] = true, this means that LocalisationCache::recache() only runs once, on the deployment server, many seconds (even minutes) before any of the app servers have the new CDB files. Reverting 7f1a3bc742 will cause MessageBlobStore::clear() to be called at this time, but that's not very useful. The MBS cache will be invalidated, but then it'll just be repopulated with stale data.

So instead, we need scap to run refreshMessageBlobs.php near the deployment process, after the new CDB files are in place. Since the MBS cache is shared in WANCache, it only needs to be run once, and running it on the deployment server is OK; it just needs to be run after all the app servers have the new CDB files in place.

(*) Technically what happens is that the generated CDB files get converted to JSON, rsynced to all the app servers, and then converted back to CDB locally on each app server, but for our purposes this is equivalent to just syncing the CDB files directly.

Change 508488 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Change 508476 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] Revert "Disable MessageBlobStore::clear() via hook"

https://gerrit.wikimedia.org/r/508476

Change 508488 merged by jenkins-bot:
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Change 508488 merged by jenkins-bot:
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Merged in scap master, will have to create a new version of the scap deb and upload it to get it to production. I can do that next week if no one beats me to it.

Change 508488 merged by jenkins-bot:
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Merged in scap master, will have to create a new version of the scap deb and upload it to get it to production. I can do that next week if no one beats me to it.

This version of scap is currently deployed to beta. I noticed that the build time trend ticked up by a bit there -- 30 seconds on top of a 2 minute (most of the time) job:

beta-scap-eqiad-time.png (432×514 px, 15 KB)

I then ran this command in prod (https://tools.wmflabs.org/sal/log/AWq8jYzXEHTBTPG-CVWR) -- I started my stop watch a little after I started the command, I stopped the clock right when the command ended. It took 11m 21s +- 10-15 seconds where I was fumbling with my watch. scap sync for a new wikiversion (worse-case as far as time is concerned) takes about 35 minutes (https://tools.wmflabs.org/sal/production?p=0&q=%22finished+scap%22&d=) so an extra 11 minutes would be a significant increase in the amount of time it takes to sync a new version.

Also, this is meant to run when you run scap pull which currently takes about 30 seconds -- 11 minutes would make SWAT quite a hassle.

Is it possible to refreshMessageBlobs in parallel for each wiki rather than in serial? Is there any other way to avoid the 11 minute penalty?

  1. Yes, it can go in parallel.
  2. Also, I can improve core so that we don't need to run it foreach-wiki in the first place. Literally all this script does is issue one single memcached command. The overhead is all in initialising the PHP process and switching between wikis. The actual command, while it only takes a millisecond to make and send, is currently per-wiki. But doesn't even need to be that. I'll patch that up for next week.
Krinkle triaged this task as Medium priority.

Change 511611 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Make MessageBlobStore global check key actually global

https://gerrit.wikimedia.org/r/511611

Change 512046 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.34.0-wmf.6] resourceloader: Make MessageBlobStore global check key actually global

https://gerrit.wikimedia.org/r/512046

Change 511611 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Make MessageBlobStore global check key actually global

https://gerrit.wikimedia.org/r/511611

Change 512048 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.34.0-wmf.5] resourceloader: Make MessageBlobStore global check key actually global

https://gerrit.wikimedia.org/r/512048

Change 512046 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.6] resourceloader: Make MessageBlobStore global check key actually global

https://gerrit.wikimedia.org/r/512046

Change 512048 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.5] resourceloader: Make MessageBlobStore global check key actually global

https://gerrit.wikimedia.org/r/512048

Mentioned in SAL (#wikimedia-operations) [2019-05-22T21:47:49Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/resourceloader/MessageBlobStore.php: T222539 / rMW3cb01cc73ce9 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2019-05-22T21:51:06Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.5/includes/resourceloader/MessageBlobStore.php: T222539 / rMW734b3d84f752 (duration: 00m 56s)

Change 512219 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] clear_message_blobs: call once per wikiversions

https://gerrit.wikimedia.org/r/512219

Thanks for cleaning this up, @Krinkle and @thcipriani. I heard about this problem during the hackathon, but forgot to work on it and then went on vacation. The way it was solved is exactly what I had in mind,

Change 513174 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] MWScript.php: Mark refreshMessageBlobs.php as a global script

https://gerrit.wikimedia.org/r/513174

Change 513174 merged by jenkins-bot:
[operations/mediawiki-config@master] MWScript.php: Mark refreshMessageBlobs.php as a global script

https://gerrit.wikimedia.org/r/513174

Change 512219 merged by jenkins-bot:
[mediawiki/tools/scap@master] clear_message_blobs: call once

https://gerrit.wikimedia.org/r/512219

Deployed on beta, the old method took ~25 seconds, new method takes ~5 seconds. I'll prep a new prod scap release.

This should now be resolved in production as of Tue, Jun 25, 7:45 AM when the new scap version (3.10.0-1) was released.

Change 508476 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Disable MessageBlobStore::clear() via hook"

https://gerrit.wikimedia.org/r/508476

Mentioned in SAL (#wikimedia-operations) [2019-09-30T21:40:08Z] <jforrester@deploy1001> Synchronized wmf-config/CommonSettings.php: T222539 Drop no-op hacky disablement of MessageBlobStore::clear() (duration: 05m 13s)