Page MenuHomePhabricator

Save times for changes to translation variable text in centralnotice paralysingly slow
Closed, ResolvedPublic

Description

On average it is taking 3.5-4 minutes to make a 1 character change to the text in translation variables within CentralNotice.

It has just taken me nearly and hour and a half to systematically create 8 variables and copy and paste some text into them.

Making a change triggers a server error after waiting ~2 minutes upon which you have to reload the page which then takes another 2 minutes.

For comparison, to make a change without using Translate Extension, it takes ~15 seconds.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Message group recache is needed if/when you add/delete/change message groups.

Message index rebuild is needed if/when you add/delete message keys. Before the it is rebuilt, Special:Translate will show the message but saving a translation can fail (it is the same for translatable pages). If I remember correctly, MessageIndexRebuildJobs have been prioritized, so there shouldn't be very long delays.

Change 459834 had a related patch set uploaded (by Mepps; owner: Mepps):
[mediawiki/extensions/CentralNotice@master] Move from clearCache which deletes everything under the key 'translate-groups' to recache, which just restores the definitions Hope for performance improvements

https://gerrit.wikimedia.org/r/459834

Change 460012 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] CommonSettings.php: Disable translation at centralnotice

https://gerrit.wikimedia.org/r/460012

Change 460012 merged by jenkins-bot:
[operations/mediawiki-config@master] CommonSettings.php: Disable translation at centralnotice

https://gerrit.wikimedia.org/r/460012

As I have no knowledge of the codebase, disabling the entire thing was my only viable proposal.

This strikes me as incredibly dangerous, seeing as one needs to know the codebase of an extension to determine the impact and potential for damage should it be disabled.

Hi! Some notes here:

  • As far as I can tell, by requesting that CentralNotice administrators refrain from creating or editing banners with translatable messages, we're preventing any affectation of cluster performance due to the specific problem described in this bug (incredibly slow save times for banners with translatable messages).
  • Nonetheless, we do need a careful review of CentralNotice-Translate interaction, to improve performance and code sanity in general, which includes this specific bug.
  • The option suggested by @Nikerabbit is the immediate action we're pursuing, though that should also not be deployed until we're clear about its impact on all bits and pieces of this complex system.
  • @Krenair, indeed we wouldn't disable CN-Translate integration without looking carefully at the impact, and, also, that doesn't look like the right option at this point :)

Thanks all!!!!

@AndyRussG Perf team will keep an eye on this, but please give us a shout if there's help needed here.

Looks like I've found the general area of code causing this. :)

tl;dr: It's very slow calls to MessageCache::updateMessageOverride(), and is the same issue as T158084. The two problems have the same cause because CentralNotice translatable messages are all saved as Mediawiki namespace pages (as well as being saved as CentralNoticeBanners namespace pages).

Pages in the Mediawiki namespace are cached as i18n messages. MessageCache::updateMessageOverride() is called by PageUpdater::doModify(), in a deferred update, when changes are made to Mediawiki namespace pages.

Verified using the debugger shell on production. On mwmaint2001, using a message in a testing banner:

$ mwrepl metawiki
hphpd> $title = Title::makeTitle( NS_MEDIAWIKI, 'test-message-T203925' )
hphpd> $content = ContentHandler::makeContent( 'test-message-T203925-test2', $title )
hphpd> MessageCache::singleton()->updateMessageOverride( $title, $content )

The last call takes about 30 seconds to return.

This is consistent with what I'm seeing on Meta. There, saving a change to banner content for a banner with no translatable messages takes 20-30 seconds, as does saving an edit to a Mediawiki namespace page directly. A CentralNotice banner with a single translatable message takes about double that to save. (In that case, the banner is one i18n message, and the translatable message is another.) Save time goes up by about the same increment for every translatable message added to the banner. The only banner saves that are actually fast are changes to banner settings for banners with no translatable messages (and there are no MessageCache calls in that case).

Finally, it looks like T204026 (huge queries on translate_metadata) is unrelated. In fact, the same query by the Translate extension is triggered by CentralNotice in several contexts. (New task for optimizing this: T205118.)

Since the evidence seems pretty strong that MessageCache is the cause of this, I'd suggest we assume for now that CentralNotice's recaching of the Translate message groups (mentioned above, distinct from the i18n message cache) isn't the problem. Still, we should look into how that's being done, and optimize/fix stuff. (I'll make a task for that tomorrow.)

Thanks, all!! @Imarlier, I think we could definitely use a hand fixing this... Many thx in advance! :)

For contrast, the same mwrelp test on enwiki took about 2 seconds:

$ mwrepl enwiki
hphpd> $title = Title::makeTitle( NS_MEDIAWIKI, 'Gadget-section-test' )
hphpd> $content = ContentHandler::makeContent( 'Testing and development', $title )
hphpd> MessageCache::singleton()->updateMessageOverride( $title, $content )
Imarlier added a comment.EditedSep 25 2018, 8:30 PM

Wrapped @AndyRussG's example above in xhprof, just running in shell:

$title = Title::makeTitle( NS_MEDIAWIKI, 'test-message-T203925' ); 
$content = ContentHandler::makeContent( 'test-message-T203925-test2', $title );
xhprof_enable(XHPROF_FLAGS_NO_BUILTINS);
MessageCache::singleton()->updateMessageOverride( $title, $content );
$xhprof_data = xhprof_disable();

Because it's a little easier to search through, I went ahead and wrote this out to a file:

$o = fopen('/tmp/shell.json', 'w');
fwrite($o, json_encode($xhprof_data, JSON_PRETTY_PRINT));
fclose($o);

MessageCache::replace creates a DeferredUpdate, which is then executed effectively immediately:

"MessageCache::replace==>DeferredUpdates::addCallableUpdate": {
    "ct": 1,
    "wt": 25505482
},
"DeferredUpdates::addCallableUpdate==>DeferredUpdates::addUpdate": {
    "ct": 1,
    "wt": 25505406
},
"DeferredUpdates::tryOpportunisticExecute==>DeferredUpdates::doUpdates": {
    "ct": 1,
    "wt": 25505179
},
"DeferredUpdates::execute==>DeferredUpdates::runUpdate": {
    "ct": 1,
    "wt": 25504901
},

Couple of things to notice here:

  • This isn't a clean stack, xhprof isn't strict about it's ordering.
  • 'ct' is the call count -- how many times the particular function was called.
  • 'wt' is the wall time -- how many seconds were spent in the function
  • XHProf isn't perfect. In particular, if a function gets called more than once via the same path, it appears to just aggregate the wall time.

So, what's taking so long? We know that DeferredUpdates::runUpdate is the caller of the long-running function, so let's find something that is called by that method (scanning for large values in the wt element makes this much easier -- they stand out):

"DeferredUpdates::runUpdate==>MWCallableUpdate::doUpdate": {
    "ct": 1,
    "wt": 25501604
},

MWCallableUpdate::doUpdate calls....

"MWCallableUpdate::doUpdate==>MessageCache::__invoke": {
    "ct": 1,
    "wt": 25501599
},

MessageCache::__invoke calls...:

"MessageCache::__invoke==>MessageCache::loadFromDB": {
    "ct": 1,
    "wt": 25114890
},

MessageCache::loadFromDB calls a number of things (a few of which take up to about half a second to return), but one stands out. In particular the number of calls that are being made -- everything above this was called a single time, this function was called more than 30,000

"MessageCache::loadFromDB==>Revision::getRevisionText": {
    "ct": 30974,
    "wt": 22692983
},

That then calls

"Revision::getRevisionText==>MediaWiki\\Storage\\SqlBlobStore::expandBlob": {
    "ct": 30974,
    "wt": 20763025
},

expandBlob calls getCacheKey() and getCacheTTL(), before calling

"MediaWiki\\Storage\\SqlBlobStore::expandBlob==>WANObjectCache::getWithSetCallback": {
    "ct": 30937,
    "wt": 19065836
},

which in turn calls

"WANObjectCache::getWithSetCallback==>WANObjectCache::doGetWithSetCallback": {
    "ct": 30938,
    "wt": 18369514
},

calls

"WANObjectCache::doGetWithSetCallback==>WANObjectCache::get": {
    "ct": 30938,
    "wt": 17162639
},

calls

"WANObjectCache::get==>WANObjectCache::getMulti": {
    "ct": 30939,
    "wt": 17072192
},

calls

"WANObjectCache::getMulti==>MemcachedPeclBagOStuff::getMulti": {
    "ct": 30939,
    "wt": 16182243
},

which....well, actually everything that MemcachedPeclBagOStuff calls returns in not-insane amounts of time. (Ironically, the only thing that seems to take a particularly egregious amount of time is the debugLog statement, which takes 0.541 seconds to return.). https://github.com/wikimedia/mediawiki/blob/master/includes/libs/objectcache/MemcachedPeclBagOStuff.php#L247 is the only call in that function that's not reflected here (because I configured xhprof to ignore system calls, and this line is calling an extension).

So, best guess, the issue is that we're pulling in nearly 31,000 separate revisions, one at a time.

Let's see how many times MemcachedPeclBagOStuff::getMulti is called in the enwiki example...

"WANObjectCache::getMulti==>MemcachedPeclBagOStuff::getMulti": {
    "ct": 1479,
    "wt": 814252,
    "cpu": 168000,
    "mu": 1075784,
    "pmu": 514352
},

Way less, which makes it seem like this may be a good theory...

I'd suggest focusing in on MessageCache::loadFromDB (and/or Revision::getRevisionText) to see if it's possible to avoid pulling in every revision. Or pull them in batches.

(FWIW, the time for each call being made to memcache is less than 1ms. I ran a couple more tests, and I was seeing response times that were generally in the range of 500-850us per memcache call -- there isn't really much that we're going to be able to do to speed that side of this process up.)

I had a glance at this this evening. My main observation is that 30305 of the 30974 pages that are being loaded here seem to belong to CentralNotice (they begin with "Centralnotice-").

It's possible that MCR changes somehow negatively affected the revision loading here, although I'm not sure about that since either way it should be having to load from ExternalStore. Maybe @daniel has some idea along that line.

When exactly did it start being slow? It'd help if we could try to figure out which changes might have broken it.

MessageCache::loadFromDB is proportional to the number of pages in NS_MEDIAWIKI. MetaWiki has a lot more pages there compared to other wikis. Translatewiki.net has even more, and there we had to implement a workaround of not loading all messages ($wgAdaptiveMessageCache) many years ago, and it's still quite slow.

The task which was duped to this was reported in July 2017, so this looks like organic growth rather than a recent regression, although I could be wrong.

So, best guess, the issue is that we're pulling in nearly 31,000 separate revisions, one at a time.

Does external store support batch queries? If so, making MessageCache batch the revision loading is a low hanging fruit. But that is still a huge amount of revisions to load, so also delaying this via JobQueue or DeferredUpdates seems a good idea.

Does external store support batch queries?

From a pure DB perspective (SQL queries), it should by design not allow range queries- it is a point select, append only store. It would be ok to batch it at application side, but I still don't understand why 30000 rows have to be fully loaded into memory. If this important (honestly, I don't fully understand how this works at high level), this should have its own dedicated, separate in-memory storage/database? MW abstractions do not really work well at large scales. I would involve Performance-Team at the very least.

Does external store support batch queries?

It does not, but it would not be terribly hard to add that. A batch losing interface would have to be added to all layers though - ExternalStore, BlobStore, RevisionStore (as well as the not-yet-existing PageRecordStore and and ContentStore).

One thing that is unclear to me is why the message cache is re-loaded at all. When a message is updated, it should just replace that message in the cache, using the updateMessageOverride() method, no?

Anomie added a subscriber: Krinkle.Sep 26 2018, 4:35 PM

Does external store support batch queries?

ExternalStore::batchFetchFromURLs() exists.

but I still don't understand why 30000 rows have to be fully loaded into memory.

It's trying to copy all messages less than a configurable size (1024 bytes on WMF sites) into memcached or the like, instead of having to hit the database for every message used on every page view.

this should have its own dedicated, separate in-memory storage/database

That's sort of what it's doing. The pages on the wiki, stored in the normal manner as any other page, are being copied into the message cache (in APC and Memcached on WMF sites).

One thing that is unclear to me is why the message cache is re-loaded at all. When a message is updated, it should just replace that message in the cache, using the updateMessageOverride() method, no?

updateMessageOverride() is what it's using. But it looks like MessageCache->replace() reloads all the messages from the DB rather than just updating the one that was changed. Fixing that seems like it could be an easy win here.

It looks like the current logic was added in rMWc962b480568e: Avoid races in MessageCache::replace(). Prior to that it used $this->load() instead of $this->loadFromDB() and so would reuse existing data from memcached instead of reloading everything from the database each time. I haven't dug into it, but it looks like @aaron and @Krinkle were running into a bug on PS13 there where messages weren't getting updated in APC so it was having to wait for the APC cache to expire before the change would show up. The solution in PS15 was to reload everything from the DB, we may need to go back and figure out a different fix.

If I understand correctly, there aren't 3000 hits to ExternalStore (MySQL), rather, there are 3000 hits to memcached keys for values originally from ExternalStore.

And that is simply because Memcached is fast (1ms RTT), but calling it 3000 times is still a problem.

When viewing a regular wiki page, the interface can contain 100s or even 1000s of unique interface messages, which load fine because we don't even do N memcached lookups for interface messages normally. Normally, the Git-based localisation cache is batch-loaded from disk into Memc and APC as a single key. And the layer over top of that (per-wiki overrides) is also batch-loaded from a single Memc key (except for pages bigger than the threshold).

But the populating of that "per-wiki overrides batch key in Memcached" is using a loop that calls Revision::getRevisionText() which typically hits Memcached here, so that's where the issue comes from. Aaron and I previously investigated a similar issue, which is that the "per-wiki overrides batch key" has a value that is almost too big on Meta-Wiki to store in a single key. It's currently at about 85% capacity for how much a single key can store.

The real problem here is that none of these keys are considered "interface message" keys. They are not used in the cases where this cache was built for. They're just in the way taking space. In the specific cases where banners are used, they can totally afford a 1ms roundtrip to Memcached to retrieve it. Not to mention the fact that we definitely don't need to preload the last 10 years of CentralNotice banners, variations, and sub messages thereof.

We're soon gonna reach that quota and then we'll need to deal with fixing CentralNotice and/or core's MessageCache to not preload these in the first place. Instead, it should probably only preload overrides for message that "exist" (e.g. correlate to a localisation cache key).

I guess for right now we're concerned with how we can quickly compute this huge unused value. However, we shouldn't (only) use batchFetchFromURLs(). We first need to expose a getMulti for the Memc keys of getRevisionText.

So apparently, I was confusing this with T204026, but I guess that is a different issue (ES vs metawiki translation_metadata), so disregard my comments.

Change 463123 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] MessageCache: replace should actually replace, not reload

https://gerrit.wikimedia.org/r/463123

The problems with CentralNotice's (ab)use of MessageCache are well documented at https://www.mediawiki.org/wiki/Requests_for_comment/CentralNotice_backend_improvements#Message_caching

AndyRussG added a comment.EditedSep 27 2018, 5:45 PM

Hi! Thanks so much, everyone!!

It's possible that MCR changes somehow negatively affected the revision loading here

I think the problem started before that (see below).

When exactly did it start being slow?

It looks like the current logic was added in rMWc962b480568e: Avoid races in MessageCache::replace().

That matches the timeline for T158084: Mediawiki namespace pages, including CentralNotice banners, are slow to save, where @spatton reported on Feb 14, 2017 that saving banners had been slow for about a week. rMWc962b480568e: Avoid races in MessageCache::replace() was deployed to Meta with MW 1.29/wmf_10 on February 1, 2017. (We only recently figured out that the issue is related to much worse save times for banners with translatable messages, first reported in July, 2017, in T170591).

The real problem here is that none of these keys are considered "interface message" keys. They are not used in the cases where this cache was built for. They're just in the way taking space. In the specific cases where banners are used, they can totally afford a 1ms roundtrip to Memcached to retrieve it. Not to mention the fact that we definitely don't need to preload the last 10 years of CentralNotice banners, variations, and sub messages thereof.

The problems with CentralNotice's (ab)use of MessageCache are well documented at https://www.mediawiki.org/wiki/Requests_for_comment/CentralNotice_backend_improvements#Message_caching

Yes, for sure. Also see T44488 and T33595. There's also quite a lot more cruft and tech debt in CentralNotice (almost all the backend code as well as the administration UI could use re-doing). At least, we need to work on a new RFC about all of this, and prioritize it.

Currently, CentralNotice uses Revision::getContent() in the administration UI to show the content of a banner's translatable messages. So, as far as that UI is concerned, it's not essential to update MessageCache before the form returns. (Note: that part of the UI has been broken-ish for a while--see T72939.)

For displaying the banner, CentralNotice does retrieve message content as a message, using IContextSource::msg().

For any possible solution, I think the top priority for now is stability at scale, since high-volume fundraising campaigns are coming up soon. Wkimedia Fundraising does not make heavy use of translatable messages in banners, though WMDE does, I think, as do some community campaigns.

@AndyRussG Using the message interface in that way is totally fine.

My intention with T193271 is to change MessageCache to only pre-cache MW-pages that represent interface message overrides for message keys recognised by the software. See that task for more details.

The pages will still exist, and they will still be available from wfMessage(). It will still be possible for users to create custom MediaWiki namespace pages by any arbitrary name and obtain them via magic words like {{int:}}, or inside the sidebar processing logic, or indeed in the way that CentralNotice does. The only difference will be that those messages get fetched from a dedicated key in Memcached for each CentralNotice message, instead of from the combined blob that currently has to contain all pages from the MediaWiki namespace on Meta-Wiki, including 10 years of CentralNotice banner history.

My intention with T193271 is to change MessageCache to only pre-cache MW-pages that represent interface message overrides for message keys recognised by the software. See that task for more details.

Interesting!

Here's a limited, temporary workaround for banners using translatable messages: use only two or three messages. Though save times will be long (1 minute or more), I think it'll work. You can include HTML fragments in the messages, so it should be doable, though cumbersome.

Here's a limited, temporary workaround for banners using translatable messages: use only two or three messages. Though save times will be long (1 minute or more), I think it'll work. You can include HTML fragments in the messages, so it should be doable, though cumbersome.

BTW, just to clarify, I meant, a temporary workaround for banner authors' difficulties using translatable messages, for cases when they're really needed.

T108736: Deleting a banner should delete everything related to such banner might also be helpful for reducing the vast number of messages.

I just realized that MCR might case Revision::getRevisionText to return null. This would effectively disable the message cache.

That should not be the case on meta.wikimedia.org (where reading the new schema isn't enabled yet), and shouldn't happen with the code in MessageCache as far as I can see. But I have filed a ticket for putting safeguards in place, see T205808: Ensure Revision::getRevisionText continues to function with MCR read-new mode.

Agabi10 added a subscriber: Agabi10.Oct 1 2018, 8:25 AM
Imarlier moved this task from Inbox to Doing on the Performance-Team board.Oct 1 2018, 8:29 PM

@aaron - Please take a look and speak up if there's obvious ways to refactor/address this.

DStrine lowered the priority of this task from Unbreak Now! to High.Oct 1 2018, 9:17 PM

Changing this from UBN to high. It was only UBN for fr-tech when we thought the issues lied solely with CentralNotice.

Change 464713 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] messagecache: use MergeableUpdate for the deferred replace() update

https://gerrit.wikimedia.org/r/464713

Change 464713 merged by jenkins-bot:
[mediawiki/core@master] messagecache: use MergeableUpdate for the deferred replace() update

https://gerrit.wikimedia.org/r/464713

Thanks much for this!!!! I debugged through the merged patch locally. Unfortunately... it looks like it might not fix the problem, since, at least locally, the mergeable updates are never merged. This is because each is added to DeferredUpdates within the execution of a different AtomicSectionUpdate. (See the Gerrit change for more details.)

Change 463123 merged by jenkins-bot:
[mediawiki/core@master] MessageCache: replace should actually replace, not reload

https://gerrit.wikimedia.org/r/463123

What is the status of this with the recent changes merged? Is it still a high priority?

What is the status of this with the recent changes merged? Is it still a high priority?

Hi! To be sure, I think we need to wait to see how it works on production. The user-facing issue doesn't occur on the beta cluster even. I think the changes will go out on this week's train, so they should be deployed to Meta on Wednesday. I think it's pretty sure it'll resolve the issue, though!!! :)

Thanks!!!!!!

Change 459834 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Move from clearCache which deletes everything under the key 'translate-groups' to recache, which just restores the definitions

https://gerrit.wikimedia.org/r/459834

@Anomie, @aaron, @Imarlier and @AndyRussG - save time for banners has plummeted in the last couple weeks and it's amazing. I'm going from an average of 30 seconds - 1 minute to clone or save changes, down to more like 5 seconds. It noticeably improves my ability to complete tasks, as the old delay fostered ineffective multitasking.

It's noticeable and awesome. TY very much.

@Anomie, @aaron, @Imarlier and @AndyRussG - save time for banners has plummeted in the last couple weeks and it's amazing. I'm going from an average of 30 seconds - 1 minute to clone or save changes, down to more like 5 seconds. It noticeably improves my ability to complete tasks, as the old delay fostered ineffective multitasking.

It's noticeable and awesome. TY very much.

@spatton that's great to hear!

@Anomie @aaron @AndyRussG Any reason not to resolve at this point?

aaron closed this task as Resolved.Nov 1 2018, 7:51 PM