Page MenuHomePhabricator

importDump.php --uploads crashes on wikitech-static
Closed, ResolvedPublic

Description

importDumps.php works, importDumps.php --uploads does not.

The first sign of trouble is

PHP Warning:  XMLReader::open(): Unable to open source data in /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php on line 158

On that line it's trying to read a file in /tmp that doesn't exist.

It's possible that we're passing in a missing image file or otherwise bad input, but even in that case I'd like the script to report and continue rather than error out. Or at the very least report something useful.

Event Timeline

Andrew triaged this task as Medium priority.Oct 2 2018, 7:01 PM
Andrew created this task.
root@wikitech-static-ord:/srv/mediawiki/w# git log -1
commit 8641df974d1a030f7b7fad1e8e45e0f51f9816e1
Author: Kunal Mehta <legoktm@member.fsf.org>
Date:   Tue Sep 25 21:44:50 2018 -0700

    Document removal of CologneBlue & Modern from the tarball
    
    Worth noting as a pre-upgrade thing, as trying to run the updater will fail
    unless it's disabled or downloaded separately.
    
    Reported by <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=909589>.
    
    Change-Id: I5ea6dc65d683dfceaa25104ca59f2f56c1ab1884
root@wikitech-static-ord:/srv/mediawiki/w# php maintenance/importDump.php --uploads /srv/imports/labswiki-20181003.xml  
100 (278.49 pages/sec 278.49 revs/sec)
200 (328.52 pages/sec 328.52 revs/sec)
upload: Gerrit_branch_permissions_wrong.png
PHP Warning:  XMLReader::open(): Unable to open source data in /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php on line 158
[e19cd5cae9135f646799735b] [no req]   Error from line 1447 of /srv/mediawiki/w/includes/filerepo/file/LocalFile.php: Call to a member function getId() on boolean
Backtrace:
#0 /srv/mediawiki/w/includes/filerepo/file/LocalFile.php(1364): LocalFile->recordUpload2(string, string, string, array, string, boolean, array)
#1 /srv/mediawiki/w/includes/import/ImportableUploadRevisionImporter.php(103): LocalFile->upload(string, string, string, integer, array, string, boolean)
#2 /srv/mediawiki/w/includes/import/WikiRevision.php(659): ImportableUploadRevisionImporter->import(WikiRevision)
#3 [internal function]: WikiRevision->importUpload()
#4 /srv/mediawiki/w/includes/libs/rdbms/database/Database.php(3191): call_user_func_array(array, array)
#5 /srv/mediawiki/w/maintenance/importDump.php(218): Wikimedia\Rdbms\Database->deadlockLoop(array)
#6 /srv/mediawiki/w/includes/import/WikiImporter.php(1049): BackupReader->handleUpload(WikiRevision)
#7 /srv/mediawiki/w/includes/import/WikiImporter.php(997): WikiImporter->processUpload(array, array)
#8 /srv/mediawiki/w/includes/import/WikiImporter.php(803): WikiImporter->handleUpload(array)
#9 /srv/mediawiki/w/includes/import/WikiImporter.php(608): WikiImporter->handlePage()
#10 /srv/mediawiki/w/maintenance/importDump.php(345): WikiImporter->doImport()
#11 /srv/mediawiki/w/maintenance/importDump.php(278): BackupReader->importFromHandle(resource)
#12 /srv/mediawiki/w/maintenance/importDump.php(114): BackupReader->importFromFile(string)
#13 /srv/mediawiki/w/maintenance/doMaintenance.php(94): BackupReader->execute()
#14 /srv/mediawiki/w/maintenance/importDump.php(350): require_once(string)
#15 {main}

At the moment, wikitech-static is running 1.31.1 (8641df9)

The first sign of trouble is
PHP Warning: XMLReader::open(): Unable to open source data in /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php on line 158
On that line it's trying to read a file in /tmp that doesn't exist.

Judging by the name, the file is being created by ImportableUploadRevisionImporter::downloadSource(), called from ImportableUploadRevisionImporter::import(). That caller uses TempFSFile to automatically delete it when the caller returns.

XmlTypeCheck is being called from MimeAnalyzer, which is being called from MWFileProps, which is being called from LocalFile::upload(). The file must exist at that point, because otherwise the fopen() on MimeAnalyzer.php line 647 would have returned false and so line 651 would have returned.

I don't know what else would make XMLReader::open() complain. Locally and on mwmaint1001 it doesn't seem to complain if passed a png file, but maybe on your system it does? Or maybe it's something special about that particular png file?

Error from line 1447 of /srv/mediawiki/w/includes/filerepo/file/LocalFile.php: Call to a member function getId() on boolean

As you noted, this is happening because the passed $user is false.

I suspect what's going on is that rMW96bd79b4a36a: Avoid DB rows with usable names but ID = 0 by introducing "interwiki" usernames missed updating WikiRevision.php line 789 (which is ImportableUploadRevisionImporter.php line 88 in 8641df9, and line 103 in master). Or else the upload being imported is attributed to an IP. Either way, try changing that line to

$user = $importableRevision->getUserObj() ?: User::newFromName( $importableRevision->getUser(), false );

That change fixes the crash and stack dump! I still get a lot of these though:

PHP Warning:  XMLReader::open(): Unable to open source data in /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php on line 158

That seems to happen to everything that isn't a .pdf (which doesn't shock me since I wouldn't expect an XML parser to handle a binary file... I must not understand what it's trying to do though)

On IRC you said it's even happening with SVGs, so there's probably something "interesting" going wrong here. I'd probably start by capturing one of the SVG files (i.e. somehow preventing its deletion, or inserting a copy() in the code somewhere to grab a copy), double-checking it's really a valid SVG, and then using mwrepl or maintenance/shell.php to see whether $xml = new XmlTypeCheck( $filename ); gives the same error or not.

I hacked the import script to leak and log all temp files. Here's a snippet:

upload: Dialog-warning.svg
source:
source: /tmp/downloadHsKUtT
PHP Warning:  XMLReader::open(): Unable to open source data in /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php on line 158
PHP Warning:  XMLReader::open(): Unable to open source data in /srv/mediawiki/w/includes/media/SVGMetadataExtractor.php on line 83
PHP Warning:  XMLReader::setParserProperty(): Invalid parser property in /srv/mediawiki/w/includes/media/SVGMetadataExtractor.php on line 95

And yet...

root@wikitech-static-ord:/srv/mediawiki/w# ls -ltrah /tmp/downloadHsKUtT
-rw------- 1 root root 6.5K Oct  3 21:03 /tmp/downloadHsKUtT

root@wikitech-static-ord:/srv/mediawiki/w# php ./maintenance/shell.php 
Psy Shell v0.8.11 (PHP 7.0.30-0+deb9u1 — cli) by Justin Hileman
>>> $xml = new XmlTypeCheck( "/tmp/downloadHsKUtT" );
=> XmlTypeCheck {#310
     +wellFormed: true,
     +filterMatch: false,
     +filterMatchType: false,
     +rootElement: "http://www.w3.org/2000/svg:svg",
     +"filterCallback": null,
   }

@Anomie, any thoughts about what to look for next?

Change 467714 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Update ImportableUploadRevisionImporter for interwiki usernames

https://gerrit.wikimedia.org/r/467714

Unfortunately no, I'm out of ideas. :( It seems like we might have to debug PHP and/or libxml2 itself here.

Change 467714 merged by jenkins-bot:
[mediawiki/core@master] Update ImportableUploadRevisionImporter for interwiki usernames

https://gerrit.wikimedia.org/r/467714

Change 467728 had a related patch set uploaded (by Legoktm; owner: Anomie):
[mediawiki/core@REL1_31] Update ImportableUploadRevisionImporter for interwiki usernames

https://gerrit.wikimedia.org/r/467728

Change 467728 merged by jenkins-bot:
[mediawiki/core@REL1_31] Update ImportableUploadRevisionImporter for interwiki usernames

https://gerrit.wikimedia.org/r/467728

I won't be working on this task in the short/mid term, cleaning assignation.

I see some of these same errors in the latest importDump from yesterday. Example:

[DBQuery] SELECT  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM 
`page`    WHERE page_namespace = '6' AND page_title = 'Gerrit_branch_permissions_wrong.png'  LIMIT 1  
Title::getRestrictionTypes: applicable restrictions to [[File:Gerrit branch permissions wrong.png]] are {edit,move,upload}
[DBQuery] SELECT  1  FROM `revision`    WHERE rev_page = '295' AND rev_timestamp = '20120314152320' AND rev_sha1 = 'hnt7uqusr9t3cey0jc2wfv14embrlhp'  LIMIT 1  
[OldRevisionImporter] ImportableOldRevisionImporter::import: skipping existing revision for [[File:Gerrit branch permissions wrong.png]], timestamp 20120314152320

[DBQuery] BEGIN
[DBQuery] SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,comment_img_description.comment_text AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,actor_img_user.actor_user AS `img_user`,actor
_img_user.actor_name AS `img_user_text`,img_actor,img_metadata  FROM `image` JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = img_description_id)) JOIN `actor` `actor_img_user` ON ((actor_img_user.actor_id = img_actor))   WHERE img_name = 'Gerrit_branch_permissions_wrong.png'  LIMIT 1  
[UploadRevisionImporter] ImportableUploadRevisionImporter::importImporting new file as Gerrit_branch_permissions_wrong.png

[http] GET: https://upload.wikimedia.org/wikipedia/labs/6/6b/Gerrit_branch_permissions_wrong.png
[Mime] MimeAnalyzer::loadFiles: loading mime types from /srv/mediawiki/w/includes/libs/mime/mime.types

[Mime] MimeAnalyzer::loadFiles: loading mime info from /srv/mediawiki/w/includes/libs/mime/mime.info

[Mime] MimeAnalyzer::doGuessMimeType: analyzing head and tail of /tmp/downloadS1LAcm for magic numbers.

[error] [d9cc131c7ffab2ed16ab0448] [no req]   ErrorException from line 158 of /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php: PHP Warning: XMLReader::open(): Unable to open source data
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php(158): XMLReader->open(string, NULL, integer)
#2 /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php(110): XmlTypeCheck->validateFromInput(string, boolean)
#3 /srv/mediawiki/w/includes/libs/mime/MimeAnalyzer.php(758): XmlTypeCheck->__construct(string)
#4 /srv/mediawiki/w/includes/libs/mime/MimeAnalyzer.php(608): MimeAnalyzer->doGuessMimeType(string, boolean)
#5 /srv/mediawiki/w/includes/utils/MWFileProps.php(73): MimeAnalyzer->guessMimeType(string, boolean)
#6 /srv/mediawiki/w/includes/filerepo/file/LocalFile.php(1333): MWFileProps->getPropsFromPath(string, boolean)
#7 /srv/mediawiki/w/includes/import/ImportableUploadRevisionImporter.php(125): LocalFile->upload(string, string, string, integer, boolean, string, User, array, boolean)
#8 /srv/mediawiki/w/includes/import/WikiRevision.php(652): ImportableUploadRevisionImporter->import(WikiRevision)
#9 /srv/mediawiki/w/includes/libs/rdbms/database/Database.php(3377): WikiRevision->importUpload()
#10 /srv/mediawiki/w/maintenance/importDump.php(220): Wikimedia\Rdbms\Database->deadlockLoop(array)
#11 /srv/mediawiki/w/includes/import/WikiImporter.php(1045): BackupReader->handleUpload(WikiRevision)
#12 /srv/mediawiki/w/includes/import/WikiImporter.php(993): WikiImporter->processUpload(array, array)
#13 /srv/mediawiki/w/includes/import/WikiImporter.php(802): WikiImporter->handleUpload(array)
#14 /srv/mediawiki/w/includes/import/WikiImporter.php(607): WikiImporter->handlePage()
#15 /srv/mediawiki/w/maintenance/importDump.php(347): WikiImporter->doImport()
#16 /srv/mediawiki/w/maintenance/importDump.php(280): BackupReader->importFromHandle(resource)
#17 /srv/mediawiki/w/maintenance/importDump.php(116): BackupReader->importFromFile(string)
#18 /srv/mediawiki/w/maintenance/doMaintenance.php(96): BackupReader->execute()
#19 /srv/mediawiki/w/maintenance/importDump.php(352): require_once(string)
#20 {main}
[Mime] MimeAnalyzer::doGuessMimeType: getimagesize detected /tmp/downloadS1LAcm as image/png

[Mime] MimeAnalyzer::guessMimeType: guessed mime type of /tmp/downloadS1LAcm: image/png

[Mime] MimeAnalyzer::improveTypeFromExtension: improved mime type for .: image/png

[DBQuery] startAtomic: entering level 0 (LocalFile::lockingTransaction)
[FileOperation] FileBackendStore::getFileStat: File mwstore://local-backend/local-public/archive/6/6b/20190607040114!Gerrit_branch_permissions_wrong.png does not exist
[DBQuery] INSERT IGNORE INTO `actor` (actor_user,actor_name) VALUES (NULL,'Imported>Saper')
[DBQuery] SELECT  actor_id  FROM `actor`    WHERE actor_user IS NULL AND actor_name = 'Imported>Saper'  LIMIT 1   LOCK IN SHARE MODE
[DBQuery] startAtomic: entering level 1 (LocalFile::recordUpload2)
[DBQuery] SELECT  comment_id  FROM `comment`    WHERE comment_hash = '1280672277' AND comment_text = 'Shouldn\'t happen?' AND comment_data IS NULL  LIMIT 1  
[DBQuery] INSERT IGNORE INTO `image` (img_name,img_size,img_width,img_height,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_metadata,img_sha1,img_description_id,img
_actor) VALUES ('Gerrit_branch_permissions_wrong.png','37851','461','305','8','BITMAP','image','png','20120314152320','a:6:{s:10:\"frameCount\";i:0;s:9:\"loopCount\";i:1;s:8:\"duration\";d:0;s:8:\"bitDepth\";i:8;s:9:\"colorType\";s:10:\"truecolour\";s:8:\"metadata\";a:4:{s:11:\"XResolution\";s:8:\"2835/100\";s:11:\"YResolution\";s:8:\"2835/100\";s:14:\"ResolutionUnit\";i:3;s:15:\"_MW_PNG_VERSION\";i:1;}}','ko3uiqjdnppvhv640fe7m7swa50l4cv','157','1567')
[DBQuery] SELECT  img_timestamp,img_sha1  FROM `image`    WHERE img_name = 'Gerrit_branch_permissions_wrong.png'  LIMIT 1   LOCK IN SHARE MODE
[DBQuery] endAtomic: leaving level 1 (LocalFile::recordUpload2)
LocalFile::recordUpload2: File 6/6b/Gerrit_branch_permissions_wrong.png already exists!
[DBQuery] endAtomic: leaving level 0 (LocalFile::lockingTransaction)
Unstubbing $wgParser on call of $wgParser::firstCallInit from MessageCache->parse
Parser: using preprocessor: Preprocessor_DOM
Unstubbing $wgLang on call of $wgLang::_unstub from ParserOptions->__construct
[GlobalTitleFail] MessageCache::parse called by BackupReader->handleUpload/Wikimedia\Rdbms\Database->deadlockLoop/WikiRevision->importUpload/ImportableUploadRevisionImporter->import/Status-
>getHTML/MessageCache->parse with no title set.
[DBQuery] SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,comment_img_description.comment_text AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,actor_img_user.actor_user AS `img_user`,actor_img_user.actor_name AS `img_user_text`,img_actor,img_metadata  FROM `image` JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = img_description_id)) JOIN `ac
tor` `actor_img_user` ON ((actor_img_user.actor_id = img_actor))   WHERE img_name = 'Gerrit_branch_permissions_wrong.png'  LIMIT 1  
[objectcache] Rejected set() for wikitech:file:6cac9943ce8f59d9c11ba919f9a65257f3104454 due to pending writes.
[DBQuery] SELECT  rd_namespace,rd_title  FROM `page`,`redirect`    WHERE page_namespace = '6' AND page_title = 'Gerrit_branch_permissions_wrong.png' AND (rd_from = page_id)  LIMIT 1  
[objectcache] Rejected set() for wikitech:image_redirect:6b6a2e437121de4f70866346c6d6e113 due to pending writes.
ForeignAPIRepo: HTTP GET: https://commons.wikimedia.org/w/api.php?titles=File%3AGerrit_branch_permissions_wrong.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime%7
Cmediatype%7Cextmetadata&prop=imageinfo&iimetadataversion=2&iiextmetadatamultilang=1&format=json&action=query&redirects=true&uselang=en
[UploadRevisionImporter] ImportableUploadRevisionImporter::import: failed: <p>The upload is an exact duplicate of the current version of <strong><a href="/wiki/File:Gerrit_branch_permissions_wrong.png" title="File:Gerrit branch permissions wrong.png">File:Gerrit branch permissions wrong.png</a></strong>.
</p>

But I see also that the file exists: https://wikitech-static.wikimedia.org/wiki/File:Gerrit_branch_permissions_wrong.png
Here's a more recent one that made it in: https://wikitech-static.wikimedia.org/wiki/File:RPKI_validation_infra.png

I rather guess that the rebuildImages.ph, followed by retrieval of the images themselves via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/wikitech-static/+/master/wikitechsync/get_images.py and another rebuildImages.php does the trick. (See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/wikitech-static/+/master/wikitechsync/import-wikitech.sh for the cron job that calls all these.)

Should we leave this open? Is it worth debugging the underlying issue or should we just ignore?

It'd be nice to figure out why XMLReader::open() seems unable to open files that seem like they do in fact exist. Is there some sort of multiprocess work going on and the process doing the open() is racing with the process writing the file? Or is it possible multiple dump processes are trying to use the same temporary file names somehow and stomping on each other? But I don't know whether it's worth the time/effort if we have a workaround.

I'll have a look this week then.

Found this. And it was silly. Just like Camelot.

It's gotten a copy of the image and stashed it in a temp dir, then it's copied (moved? whatever) that into a nice archive name (that's another issue we need to fix in the workflow but later). And now it's going to get file properties so it can fill out the row in the oldimage (in our case) table with yummy things like bit depth, color type and stuff.

So here's how it gets properties.

  • Does it have one of these magic headers?
  • Is it WebM?
  • Maybe it's php?
  • Oh, let me try opening it as XML in case it's an svg.

Right here the XMLReader method fails, naturally, to open the blasted thing because it's a blob of binary crap, as png files are.
And so it throws a warning, which is treated as fatal on the wikitech-static setup, and there we have it.

The fix is two little lines, patch incoming shortly. But it's against master. I've live-hacked the one file on wikitech-static in the meantime.

Change 516787 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/516787

Testing was done using a tiny xml file, running:

root@wikitech-static:/srv/mediawiki/w# rm /var/log/debug-wikitech.log
root@wikitech-static:/srv/mediawiki/w# rm /srv/mediawiki/images/wikitech/archive/4/4c/20130119210037\!20130119-2158-PuTTY_Configuration.png
root@wikitech-static:/srv/mediawiki/w# php maintenance/importDump.php --uploads /root/testimport.xml

and looking at the log.

Change 516787 merged by jenkins-bot:
[mediawiki/core@master] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/516787

Reedy assigned this task to ArielGlenn.

Change 550040 had a related patch set uploaded (by Reedy; owner: ArielGlenn):
[mediawiki/core@REL1_33] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/550040

Change 550041 had a related patch set uploaded (by Reedy; owner: ArielGlenn):
[mediawiki/core@REL1_32] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/550041

Change 550042 had a related patch set uploaded (by Reedy; owner: ArielGlenn):
[mediawiki/core@REL1_31] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/550042

Change 550042 merged by jenkins-bot:
[mediawiki/core@REL1_31] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/550042

Change 550041 merged by jenkins-bot:
[mediawiki/core@REL1_32] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/550041

Change 550040 merged by jenkins-bot:
[mediawiki/core@REL1_33] when getting file (img) properties, suppress whines that it's not xml

https://gerrit.wikimedia.org/r/550040