Page MenuHomePhabricator

Remove direct access to the text table from the Translate extension.
Closed, ResolvedPublic3 Story Points

Description

The Translate extension relies on direct access to the text table, though joining against rev_text_id, in several places. This will no longer be possible with the new MCR schema, and should be replaced as described in T198341, T198342, and T198343.

The Translate extension is a somewhat special case as the transition to the new schema may cause problems with respect to backwards compatibility with older versions of MediaWiki, as well as performance issues, since bulk queries of page content are no longer possible.

Details

Related Gerrit Patches:

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedNone
ResolvedCCicalese_WMF
Resolveddaniel
ResolvedPchelolo
Resolveddaniel
ResolvedBPirkle
ResolvedPchelolo
ResolvedPchelolo
ResolvedPchelolo
ResolvedPchelolo
ResolvedPchelolo
Resolveddaniel
Resolveddaniel

Event Timeline

daniel created this task.Jul 22 2019, 3:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2019, 3:11 PM

Change 524593 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Translate@master] Remove direct access to text table

https://gerrit.wikimedia.org/r/524593

Tagging the language team for input on performance implications and backwards compatibility. @Nikerabbit, any thoughts?

"Bulk queries of page content are no longer possible" -- a performance regression like this this is unacceptable in my view.

WDoranWMF lowered the priority of this task from High to Normal.Jul 24 2019, 9:47 PM
WDoranWMF moved this task from MCR to mop on the Core Platform Team board.Jul 26 2019, 6:38 PM

Change 537483 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/Translate@master] WIP: use batch access for revisions

https://gerrit.wikimedia.org/r/537483

WDoranWMF set the point value for this task to 3.Sep 17 2019, 4:30 PM

The patch above handles usages in production code. There's 3 usages left in maintenance scripts and import action.

Those are quite fixable as well, but I'm inclined to separate those into it's own patch. Also, it would be nice to know if those scripts are still used/useful and whether maintaining pre-1.34 compatibility is useful (are those expected to be used by third-party wikis?)

There's 3 usages left in maintenance scripts and import action.

Could you name the files?

There's 3 usages left in maintenance scripts and import action.

Could you name the files?

scripts/fuzzy.php, scripts/populateFuzzy.php and MessageWebImporter

MessageWebImporter is used by Special:ImportTranslations which is used by translators that prefer to work with offline translation tools. It needs to keep BC per MLEB release policy.

fuzzy.php is used by translation admins in translatewiki.net, often as our only option for cleaning up bad translations. It is a command line script so it is less performance sensitive and may not require batch loading of contents. I'm fine forgoing BC code here if it makes things simpler, but would be nice to have it.

populateFuzzy.php is an old maintenance script to populate revtag table. It can be removed, unless there is a trivial fix to keep it working (even slowly or without BC).

Change 537736 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/Translate@master] Replace Revision::getRevisionText for importer and maintenance scripts

https://gerrit.wikimedia.org/r/537736

Change 538107 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/Translate@master] Don't use rev_text_id to update translation tag for null revision

https://gerrit.wikimedia.org/r/538107

daniel added a comment.EditedSep 23 2019, 11:48 AM

I just realized that old_text and old_flags are also referenced directly in a couple of places. That needs to be fixed as well. Search:
https://codesearch.wmflabs.org/search/?q=%5B%20%27%22%3E.%3D%5Dold_(id%7Cflags%7Ctext)&i=nope&files=&repos=Extension:Translate

All of the above are addressed by the existing patches.

All of the above are addressed by the existing patches.

Thank you for double checking!

Change 538107 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Don't use rev_text_id to update translation tag for null revision

https://gerrit.wikimedia.org/r/538107

Here is a profiling visualization. Do note that this was taken with Ib33b4afc14f1098 cherry-picked on top of my optimization with reduces use of Title::makeTitleSafe (T230100). Still it manages to be about 20x slower.

Compare with before both patches:

Change 524593 abandoned by Daniel Kinzler:
Remove direct access to text table

Reason:
superseded

https://gerrit.wikimedia.org/r/524593

Here is a profiling visualization. Do note that this was taken with Ib33b4afc14f1098 cherry-picked on top of my optimization with reduces use of Title::makeTitleSafe (T230100). Still it manages to be about 20x slower.

The added layers of abstraction will make things a bit slower, but I'm surprised by the slowdown. I would expect this to be I/O-bound. Does the profiling reflect CPU time or wall-clock time?

My guess is that there must be something wrong with the DB queries - either we are running way to many, or one of them got really inefficient. The first thing I'd try is to count the number of queries. It's going to be larger, but should not be x20 larger. Maybe x2.

What's the test setup? What do I have to do to reproduce this?

...looking at the picture, I see a lot of calls to Title::newFromId, and RevisionStore::loadSlotRecords. Both should not be called much when RevisionStore::newRevisionsFromBatch(). It's designed to avoid exactly this. So either there is more code that needs to be changed to use newRevisionsFromBatch(), or there is a bug that prevents the optimization from being effective.

Here is a profiling visualization. Do note that this was taken with Ib33b4afc14f1098 cherry-picked on top of my optimization with reduces use of Title::makeTitleSafe (T230100). Still it manages to be about 20x slower.

The added layers of abstraction will make things a bit slower, but I'm surprised by the slowdown. I would expect this to be I/O-bound. Does the profiling reflect CPU time or wall-clock time?

I don't know honestly. Default xdebug output. It is definitely not IO bound with profiling enabled, and judging by the 20x increase, I don't think that is explained by 20x more IO.

What's the test setup? What do I have to do to reproduce this?

FWIW we chatted about this on IRC briefly. I was measuring our production (twn that is) mediawiki core export that we are optimizing anyway in T230100.

...looking at the picture, I see a lot of calls to Title::newFromId, and RevisionStore::loadSlotRecords. Both should not be called much when RevisionStore::newRevisionsFromBatch(). It's designed to avoid exactly this. So either there is more code that needs to be changed to use newRevisionsFromBatch(), or there is a bug that prevents the optimization from being effective.

Does this depend on any recent core changes? I was testing with about week old core code.

Pchelolo added a comment.EditedSep 25 2019, 4:07 PM

Does this depend on any recent core changes? I was testing with about week old core code.

Only newRevisionsFromBatch introduced in T228988 and a further optimization in T233173, so anything after 1.24 should be good.

I've uploaded a new version of the patch that attempts to mitigate the performance degradation. The problem was that we were still using the anti-pattern of creating a batch of revisions looping through db rows, just indirectly via ThinMessage

Change 539324 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] RevisionSTore: Introduce getSlotRowsForBatch

https://gerrit.wikimedia.org/r/539324

daniel raised the priority of this task from Normal to High.Sep 30 2019, 2:38 PM

Bumping to high, since this blocks 1.34

Change 539324 merged by jenkins-bot:
[mediawiki/core@master] RevisionStore: Introduce getContentBlobsForBatch

https://gerrit.wikimedia.org/r/539324

Change 537483 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Use batch access for revisions instead of Revision::getRevisionText

https://gerrit.wikimedia.org/r/537483

Change 537736 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Replace Revision::getRevisionText for importer and maintenance scripts

https://gerrit.wikimedia.org/r/537736

Pchelolo closed this task as Resolved.Oct 7 2019, 7:07 PM

All access has been either removed or gated. Resolving.

Change 547327 had a related patch set uploaded (by Jforrester; owner: Ppchelko):
[mediawiki/extensions/Translate@REL1_34] Replace Revision::getRevisionText for importer and maintenance scripts

https://gerrit.wikimedia.org/r/547327

Change 547327 abandoned by Jforrester:
Replace Revision::getRevisionText for importer and maintenance scripts

https://gerrit.wikimedia.org/r/547327