Page MenuHomePhabricator

XMP errors when using FileImporter on jpeg images
Closed, ResolvedPublic

Description

This error seems to be happening c. 10 times per day.

For example, with this file: https://en.wikisource.org/wiki/File:Zika-Virus-Medical-Countermeasure-Development-Challenges-pntd.0004530.t002.jpg

#0 /srv/mediawiki/php-1.36.0-wmf.14/includes/media/BitmapMetadataHandler.php(179): Wikimedia\XMPReader\Reader->parse(string)
#1 /srv/mediawiki/php-1.36.0-wmf.14/includes/media/JpegHandler.php(105): BitmapMetadataHandler::Jpeg(string)
#2 /srv/mediawiki/php-1.36.0-wmf.14/includes/media/ExifBitmapHandler.php(181): JpegHandler->getMetadata(FSFile, string)
#3 /srv/mediawiki/php-1.36.0-wmf.14/includes/utils/MWFileProps.php(89): ExifBitmapHandler->getImageSize(FSFile, string, string)
#4 /srv/mediawiki/php-1.36.0-wmf.14/includes/filerepo/file/LocalFile.php(1393): MWFileProps->getPropsFromPath(string, string)
#5 /srv/mediawiki/php-1.36.0-wmf.14/includes/import/ImportableUploadRevisionImporter.php(127): LocalFile->upload(string, string, string, integer, boolean, string, User, array, boolean)
#6 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/Operations/FileRevisionFromRemoteUrl.php(197): ImportableUploadRevisionImporter->import(WikiRevision)
#7 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/Data/ImportOperations.php(133): FileImporter\Operations\FileRevisionFromRemoteUrl->commit()
#8 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/Data/ImportOperations.php(82): FileImporter\Data\ImportOperations->FileImporter\Data\{closure}(FileImporter\Operations\FileRevisionFromRemoteUrl)
#9 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/Data/ImportOperations.php(134): FileImporter\Data\ImportOperations->runOperations(integer, integer, boolean, Closure)
#10 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/Services/Importer.php(310): FileImporter\Data\ImportOperations->commit()
#11 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/Services/Importer.php(174): FileImporter\Services\Importer->commitImportOperations(FileImporter\Data\ImportOperations)
#12 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/SpecialImportFile.php(335): FileImporter\Services\Importer->import(User, FileImporter\Data\ImportPlan)
#13 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/SpecialImportFile.php(236): FileImporter\SpecialImportFile->doImport(FileImporter\Data\ImportPlan)
#14 /srv/mediawiki/php-1.36.0-wmf.14/extensions/FileImporter/src/SpecialImportFile.php(201): FileImporter\SpecialImportFile->handleAction(string, FileImporter\Data\ImportPlan)
#15 /srv/mediawiki/php-1.36.0-wmf.14/includes/specialpage/SpecialPage.php(600): FileImporter\SpecialImportFile->execute(NULL)
#16 /srv/mediawiki/php-1.36.0-wmf.14/includes/specialpage/SpecialPageFactory.php(1018): SpecialPage->run(NULL)
#17 /srv/mediawiki/php-1.36.0-wmf.14/includes/MediaWiki.php(307): MediaWiki\SpecialPage\SpecialPageFactory->executePath(Title, RequestContext)
#18 /srv/mediawiki/php-1.36.0-wmf.14/includes/MediaWiki.php(940): MediaWiki->performRequest()
#19 /srv/mediawiki/php-1.36.0-wmf.14/includes/MediaWiki.php(543): MediaWiki->main()
#20 /srv/mediawiki/php-1.36.0-wmf.14/index.php(53): MediaWiki->run()
#21 /srv/mediawiki/php-1.36.0-wmf.14/index.php(46): wfIndexMain()
#22 /srv/mediawiki/w/index.php(3): require(string)
#23 {main}

Discovered during T243220: Create logstash monitoring dashboard for Technical Wishes products.

Event Timeline

thiemowmde subscribed.

The fascinating piece we can see at https://logstash.wikimedia.org is that the method \Wikimedia\XMPReader\Reader::parse() is called with something that – at first glance – looks like it should be parseable:

<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?
      <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='Image::ExifTool 8.60'>
      <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
      <rdf:Description rdf:about=''
       xmlns:dc='http://purl.org/dc/elements/1.1/'>
        <dc:date>
         <rdf:Seq>
         <rdf:li>2016-3-2</rdf:li>
         </rdf:Seq>
        </dc:date>
        <dc:description>
         <rdf:Alt>
          <rdf:li xml:lang='x-default'></rdf:li>
         </rdf:Alt>
        </dc:description>
        <dc:identifier>10.1371/journal.pntd.0004530.t001</dc:identifier>
        <dc:publisher>
         <rdf:Bag>
         <rdf:li>Public Library of Science</rdf:li>
         </rdf:Bag>
        </dc:publisher>
        <dc:title>
         <rdf:Alt>
          <rdf:li xml:lang='x-default'>Table 1</rdf:li>
         </rdf:Alt>
        </dc:title>
        <dc:rights>
         <rdf:Alt>
          <rdf:li xml:lang='x-default'>Creative Commons Attribution License</rdf:li>
         </rdf:Alt>
        </dc:rights>
       </rdf:Description>
       <rdf:Description rdf:about=''
        xmlns:photoshop='http://ns.adobe.com/photoshop/1.0/'>
        <photoshop:Source>10.1371/journal.pntd.0004530</photoshop:Source>
       </rdf:Description>
      </rdf:RDF>
      </x:xmpmeta>
      <?xpacket end='w'?>

Note the first line doesn't end with ?>, as it should. This is actually in the file.

Possible ways forward:

  1. Add a magic auto-fix to XMPReader that detects such cases and adds the missing > back.
  2. Change \BitmapMetadataHandler::Jpeg() to skip failing parse attempts, instead of letting the exception bubble up.
  3. Change \MWFileProps::getPropsFromPath() to catch exceptions when it calls getMetadata(), and leave the metadata element empty. Code calling getPropsFromPath() can't and shouldn't expect this field to contain something. I checked all callers, and the vast majority does not even use the metadata field. The few that use metadata are fine with it being empty.

There is nothing we can do in the FileImporter codebase, as far as I can see.

The worst effects I can imagine:

  • The database field img_metadata might be empty or miss fields. But this is expected, as not all files contain XMP, and secondary storage anyway.
  • Some code calls \MediaHandler::getContentHeaders() with the unserialized metadata array. But this method doesn't do anything, at the moment, and needs to expect the array to be empty anyway.
  • We will allow the upload of technically broken .jpg files that could potentially contain anything in the XMP segment. Is this relevant in terms of Security? However, note that this check is optional anyway, and depends on function_exists( 'xml_parser_create_ns' ) && class_exists( 'XMLReader' ).

Personally, I would implement both the suggestions 1 and 3.

It looks like there is another FileImporter-related error happening in XMPReader:

#0 [internal function]: Wikimedia\XMPReader\Reader->char(resource, string)
#1 /srv/mediawiki/php-1.36.0-wmf.16/vendor/wikimedia/xmp-reader/src/Reader.php(373): xml_parse(resource, string, boolean)
#2 /srv/mediawiki/php-1.36.0-wmf.16/includes/media/BitmapMetadataHandler.php(214): Wikimedia\XMPReader\Reader->parse(string)
#3 /srv/mediawiki/php-1.36.0-wmf.16/includes/media/PNGHandler.php(39): BitmapMetadataHandler::PNG(string)
#4 /srv/mediawiki/php-1.36.0-wmf.16/includes/utils/MWFileProps.php(87): PNGHandler->getMetadata(FSFile, string)
#5 /srv/mediawiki/php-1.36.0-wmf.16/includes/filerepo/file/LocalFile.php(1393): MWFileProps->getPropsFromPath(string, string)
#6 /srv/mediawiki/php-1.36.0-wmf.16/includes/import/ImportableUploadRevisionImporter.php(127): LocalFile->upload(string, string, string, integer, boolean, string, User, array, boolean)
#7 /srv/mediawiki/php-1.36.0-wmf.16/extensions/FileImporter/src/Operations/FileRevisionFromRemoteUrl.php(197): ImportableUploadRevisionImporter->import(WikiRevision)
…

Similar to above:

  • In BitmapMetadataHandler::PNG() is a parse() call. We could wrap this in a try-catch and just not extract metadata if they can't be parsed, but not mark the entire file as "broken" just because of this.
  • There is a third one in BitmapMetadataHandler::GIF().

The relevant XML is:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.4.0">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <exif:UserComment>Screenshot</exif:UserComment>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

Just from looking at this I can't tell what is causing the error.

When the XMP is user input and already broken in the file it should be the best to ignore it when doing anything with it. Broken exifs also ignored and should not trigger issues on the server which are not fixable, because files are not changed after upload. Rejecting broken XMP on upload would be another way, but than the existing files are still there and needs a script to be ? deleted ? fixed ?

The magic fix is only needed when Photoshop or GIMP can understand the "broken" files already.

The patch https://gerrit.wikimedia.org/r/693298 (T275268) touched some of the relevant try-catch. The issue might be resolved with this change.

WMDE-Fisch subscribed.

Just looked into that briefly today and we now have about 6-7 distinct cases over 3 months where this type of error shows up in the logs. On the other hand we have >10k successful imports during that time frame. I guess we could consider the main issue to be resolved for the time being.