Page MenuHomePhabricator

maintenance/importDump.php fails for wikidatawiki XML incremental dump files
Open, NormalPublic

Description

Dear Sir,

The CLI utility maintenance/importDump.php fails to process XML incremental data dump files for wikidatawiki.

mediawiki version: wmf/1.24wmf8
dataset URL: https://dumps.wikimedia.org/other/incr/wikidatawiki/
datasets tested: wikidatawiki-20140706-pages-meta-hist-incr.xml.bz2, through
wikidatawiki-20140803-pages-meta-hist-incr.xml.bz2

Even after the incremental dump file for 20140706 is split into smaller dump files each containing a single page, only about one in a hundred such single page dump files are processed successfully.


Version: master
Severity: major
Whiteboard: u=dev c=backend p=0
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63228

Details

Reference
bz70898

Event Timeline

bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz70898.
bzimport added a subscriber: Unknown Object (MLST).

If it 'fails', what is the error? And what are exact steps to reproduce?

To reproduce:

0) set up wiki farm using wikimedia method

See https://www.mediawiki.org/wiki/Manual:Wiki_family#Wikimedia_Method

  1. write helper script:

(rootshell)# cat /usr/share/mediawiki/maintenance/importDump_farm.php
<?php

  1. importDump_farm.php script #
  2. Usage: /usr/bin/php /usr/share/mediawiki/maintenance/importDump_farm.php \
  3. zuwiki-20121002-pages-articles-p000001000-c000001000.xml \
  4. zu.wikipedia.site #
  5. $argv[1] is the xchunk file-name

$_SERVER['SERVER_NAME'] = $argv[2];
#$_SERVER['DOCUMENT_ROOT'] = $argv[3]; #optional
define( 'IMPORTDUMP_FARM', true);
include('importDump.php');

  1. download incremental XML data dump files (xincr)s

(rootshell)# /usr/bin/wget https://dumps.wikimedia.org/other/incr/simplewiki/20140803/simplewiki-20140803-pages-meta-hist-incr.xml.bz2
(rootshell)# /usr/bin/wget https://dumps.wikimedia.org/other/incr/wikidatawiki/20140803/wikidatawiki-20140803-pages-meta-hist-incr.xml.bz2

  1. import into database

(rootshell)# /usr/bin/php /usr/share/wp-mirror-mediawiki/maintenance/importDump_farm.php simplewiki-20140803-pages-meta-hist-incr.xml.bz2 simple.wikipedia.site
100 (8.01 pages/sec 12.58 revs/sec)
100 (7.40 pages/sec 11.69 revs/sec)
200 (10.14 pages/sec 15.42 revs/sec)
Done!
You might want to run rebuildrecentchanges.php to regenerate RecentChanges

(rootshell)# /usr/bin/php /usr/share/wp-mirror-mediawiki/maintenance/importDump_farm.php wikidatawiki-20140803-pages-meta-hist-incr.xml.bz2 www.wikidata.site
[4a104de5] [no req] Exception from line 1324 of /usr/share/wp-mirror-mediawiki/extensions/Wikidata/extensions/Wikibase/repo/Wikibase.hooks.php: To avoid ID conflicts, the import of Wikibase entities is currently not supported.
Backtrace:
#0 [internal function]: Wikibase\RepoHooks::onImportHandleRevisionXMLTag(WikiImporter, array, array)
#1 /usr/share/wp-mirror-mediawiki/includes/Hooks.php(206): call_user_func_array(string, array)
#2 /usr/share/wp-mirror-mediawiki/includes/GlobalFunctions.php(4056): Hooks::run(string, array, NULL)
#3 /usr/share/wp-mirror-mediawiki/includes/Import.php(690): wfRunHooks(string, array)
#4 /usr/share/wp-mirror-mediawiki/includes/Import.php(654): WikiImporter->handleRevision(array)
#5 /usr/share/wp-mirror-mediawiki/includes/Import.php(507): WikiImporter->handlePage()
#6 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(298): WikiImporter->doImport()
#7 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(256): BackupReader->importFromHandle(resource)
#8 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(102): BackupReader->importFromFile(string)
#9 /usr/share/wp-mirror-mediawiki/maintenance/doMaintenance.php(109): BackupReader->execute()
#10 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(303): require_once(string)
#11 /usr/share/wp-mirror-mediawiki/maintenance/importDump_farm.php(12): include(string)
#12 {main}

Hmm, Wikibase entities... adding mailinglist to CC field.

hoo added a comment.Sep 23 2014, 12:56 PM

We prevent this in Wikibase as importing Wikibase content usually doesn't work because entities are being referred to by entity ids, which probably don't exist or don't contain the wanted content (see bug 63228). That of course doesn't apply in case you have *all* other entities from the Wiki you're importing from (Wikidata) already...

Maybe we want make it possible to import Wikibase content via shell?

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
daniel added a subscriber: daniel.Feb 9 2015, 9:02 PM

Please read the actual error message: "To avoid ID conflicts, the import of Wikibase entities is currently not supported." This is not a bug, it's working (or rather, failing) as designed.

That said, we do plan to make importing possible: T85133

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2015, 10:49 PM
wpmirrordev updated the task description. (Show Details)Aug 30 2015, 1:43 PM
wpmirrordev set Security to None.

I repeated the attempt to import wikidatawiki with importDump.php using more recent s/w and dumps.

mediawiki version: wmf/1.26wmf10
Dumps: from wikidatawiki-20150604-pages-meta-hist-incr.xml.bz2 through wikidatawiki-20150628-pages-meta-hist-incr.xml.bz2 inclusive.

Backtrace now looks like:

[5a85b066] [no req] MWException from line 326 of /usr/share/wp-mirror-mediawiki/includes/content/ContentHandler.php: No handler for model 'wikibase-item' registered in $wgContentHandlers
Backtrace:
#0 /usr/share/wp-mirror-mediawiki/includes/Import.php(1427): ContentHandler::getForModelID(string)
#1 /usr/share/wp-mirror-mediawiki/includes/Import.php(818): WikiRevision->getContentHandler()
#2 /usr/share/wp-mirror-mediawiki/includes/Import.php(793): WikiImporter->processRevision(array, array)
#3 /usr/share/wp-mirror-mediawiki/includes/Import.php(742): WikiImporter->handleRevision(array)
#4 /usr/share/wp-mirror-mediawiki/includes/Import.php(566): WikiImporter->handlePage()
#5 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(299): WikiImporter->doImport()
#6 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(257): BackupReader->importFromHandle(resource)
#7 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(102): BackupReader->importFromFile(string)
#8 /usr/share/wp-mirror-mediawiki/maintenance/doMaintenance.php(103): BackupReader->execute()
#9 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(304): require_once(string)
#10 /usr/share/wp-mirror-mediawiki/maintenance/importDump_farm.php(12): include(string)
#11 {main}

Can you confirm that you have the Wikibase repository extension installed and enabled? It should show up in Special:Version.

Also, you will have to set allowEntityImport to true in $wgWBRepoSettings, to enable the import of Wikibase entities. Failing to do that would result in a different error message, though.

I see. When I do install the Wikibase repository extension, I get:

[5e304d8b] [no req] MWException from line 1059 of /usr/share/wp-mirror-mediawiki/extensions/Wikidata/extensions/Wikibase/repo/Wikibase.hooks.php: To avoid ID conflicts, the import of Wikibase entities is not supported. You can enable imports using the allowEntityImport setting.
Backtrace:
#0 [internal function]: Wikibase\RepoHooks::onImportHandleRevisionXMLTag(WikiImporter, array, array)
#1 /usr/share/wp-mirror-mediawiki/includes/Hooks.php(204): call_user_func_array(string, array)
#2 /usr/share/wp-mirror-mediawiki/includes/Import.php(780): Hooks::run(string, array)
#3 /usr/share/wp-mirror-mediawiki/includes/Import.php(742): WikiImporter->handleRevision(array)
#4 /usr/share/wp-mirror-mediawiki/includes/Import.php(566): WikiImporter->handlePage()
#5 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(299): WikiImporter->doImport()
#6 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(257): BackupReader->importFromHandle(resource)
#7 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(102): BackupReader->importFromFile(string)
#8 /usr/share/wp-mirror-mediawiki/maintenance/doMaintenance.php(103): BackupReader->execute()
#9 /usr/share/wp-mirror-mediawiki/maintenance/importDump.php(304): require_once(string)
#10 /usr/share/wp-mirror-mediawiki/maintenance/importDump_farm.php(12): include(string)
#11 {main}

This is similar to what I saw last year.
I shall wait a year and try again.
Thanks.

aude added a comment.Sep 8 2015, 6:46 PM

@wpmirrordev you can set allowEntityImport = true, but then you might run into T111787 and T108544 :/ Once those are fixed, then think this should work.