Page MenuHomePhabricator

Regression: Maintenance/importDump.php is 20 times slower than before
Closed, ResolvedPublic

Description

Hi there!

After upgrading from 1.24.x to 1.26.2+ (including the latest 1.27.1) we noticed that importing XML dumps took about 20 times longer than before.

Debugging shows, that the code added to "WikiImporter->finishImportPage" in https://phabricator.wikimedia.org/rMW341dfa2587220c8e9dff5866036b3092ceb682c4 (lines 368 to 387) slows down imports massively: Calling "maintenance/importDump.php" we get between 8 and 11 Revisions/s imported instead of ~250 Revisions/s.

Some context:
We use "importDump.php" to synchronize our main wiki (read-write, restricted access) to a number of read-only, "front-line" wikis. The main wiki sports about 45.000 pages and synchronization should run hourly. Users are not amused as this process now takes more than 100 minutes (instead of 6 minutes before).
Calling "initSiteStats.php" once afterwards would take less than two seconds -- if we even cared about page counter bling-bling...

Best regards,
Thomas

Event Timeline

Tvoigt2 created this task.Sep 2 2016, 2:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2016, 2:09 PM
TTO added a comment.EditedSep 2 2016, 2:19 PM

Hmm, I really thought we'd seen the end of this issue...

I'll have a look and see what can be done. Certainly if you want a quick fix you can just remove this code from your installations, as it is really meant for transwiki imports rather than dump imports.

I'm surprised no-one else has reported such slowness. You'd think someone else would have reported a 20 times slowdown, particularly since 1.26 has been out for a long time. Edit: In fact it was even in 1.25. Are you sure it is not your DB or app server(s) that are slow?

Tvoigt2 added a comment.EditedSep 2 2016, 3:18 PM

Certainly if you want a quick fix you can just remove this code from your installations, as it is really meant for transwiki imports rather than dump imports.

Sure, already done. But an upstream fix never hurts ;-)

Edit: In fact it was even in 1.25.

We skipped 1.25, started migrating to 1.26.2 and decided to wait for 1.27 LTS due to time constraints.

Are you sure it is not your DB or app server(s) that are slow?

Removing the new callback code brings import speed back to ~250 revs/s in 1.27.1.

Regardless whether our servers are fast or slow: Does one buggy informational counter (and its implied ego boost) justify tens of thousands of additional page instantiations and database lookups?

Maybe "WikiImporter" needs an additional property or sub type to differentiate between bulk import and SpecialPage scenarios: When called via "maintenance/importDump.php" it reverts to the previous broken-counter-behaviour, and warns that the page counter may be broken (as importDumps already urges to run "rebuildRecentChanges.php" afterwards). Running "importDump.php" implies shell access, so suggesting an additional run of "initSiteStats.php" -- that takes no time at all -- seems acceptable.
I'd rather not have importDump updating the counters itself as we batch some hundred XML import files containing 1000 revisions each to mitigate php's shortcomings in parsing large XML documents...

Thanks and best regards,
Thomas

Change 330223 had a related patch set uploaded (by Subins2000):
Disable statistics update on import with maintenance/importDump.php

https://gerrit.wikimedia.org/r/330223

Change 330223 merged by jenkins-bot:
Disable statistics update on import with maintenance/importDump.php

https://gerrit.wikimedia.org/r/330223

TTO closed this task as Resolved.Jan 4 2017, 1:35 AM
TTO assigned this task to subins2000.

Statistics update is now disabled when using maintenance/importDump.php. The script displays the message

You might want to run rebuildrecentchanges.php to regenerate RecentChanges,
and initSiteStats.php to update page and revision counts

I think this resolves the issue... I can't think of a better way of doing things that isn't going to be terribly slow.

I think a new parameter to the importDump.php script could be added to control if we want to update statistics or not. Maybe for importing small dumps (not full dumps but a list of pages grabbed from somewhere else) it may be useful to still update statistics.