Page MenuHomePhabricator

Certain TIFF files get no thumbnail and misidentified dimensions ("0 × 0 pixels") and number of pages (0 pages)
Open, Needs TriagePublicBUG REPORT

Description

Hello!

I encountered a few TIFF files on Commons that don't get a thumbnail by the software. Their height and width are also identified as 0x0 and their number of pages as 0. When you click on the files however, you see what is expected as any ordinary file. The files are at https://commons.wikimedia.org/wiki/User:Jonteemil/sandbox3. Talk have also taken place at https://commons.wikimedia.org/wiki/User_talk:Fæ#Empty_files.

Event Timeline

Aklapper renamed this task from Certain TIFF files get no thumbnail and misidentified dimensions and page amount to Certain TIFF files get no thumbnail and misidentified dimensions ("0 × 0 pixels") and number of pages (0 pages).Sep 7 2021, 7:46 AM

@Jonteemil: Thanks for reporting this. In the future, please use the bug report form (linked from the top of the task creation page) to create bug reports. Thanks.

Do you see any pattern, for example regarding file or pixel size?

Aklapper changed the subtype of this task from "Task" to "Bug Report".Sep 7 2021, 7:52 AM

@Aklapper I'll try remember to use the bug report form in the future. I can't seem to find any pattern, despite the fact that 99% of the files were uploaded by only two users. I wasn't sure about creating a task for this since I didn't know if the problem lied with Mediawiki or the files itself but @AntiCompositeNumber recommended a task be created to see if someone with Logstash access could see if there are any useful error logs.

Pattern seems to be big files in size and/or resolution and TIFF format. We have other TIFF files that are equally large in dimensions and file size, so it is not necessarily a hard limit there, but there could be a bug on the rendering library/handling metadata information, or hitting some memory limit or timeout when extracting structured data. They also seem to be old files, so I wouldn't be surprised if it is an old bug, now solved, with stricter limits. I wouldn't be surprised if TIFF libraries do not handle edge cases as well as png or jpegs.

Edit: Not all file sizes are "large" in bytes- this is relatively small but large in resolution: https://commons.wikimedia.org/wiki/File:Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_3_of_5).tiff

A bit offtopic- one thing I have observed lately is many TIFF files do not use compression, taking hundreds of megabytes in cases where a few dozens would return the same lossless image, and I wonder if we should recommend on the upload manual to use it, which will make the life of the uploader, the processing software's and the downloaders'/reusers' easier?

Not a Thumbor problem, thumbnailing works fine. @Aklapper, please test if thumbnail generation is actually affected before tagging Thumbor by rewriting the url as follows:

upload.wikimedia.org/wikipedia/commons/a/ab/filename.ext -> upload.wikimedia.org/wikipedia/commons/thumb/a/ab/filename.ext/220px-filename.ext.jpg

(for TIFF and WebP add .jpg or .png to the thumbnail filename, for SVG add .png, for .png and .jpg don't add an extension)


When I tested this locally, I couldn't get it to break. Core MediaWiki uses the PHP exif parser for TIFF metadata (including size), but we have MediaWiki-extensions-PagedTiffHandler installed which gets image size a different way. By default it uses ImageMagick identify, which did try to use more memory than allowed by the limits included in the Debian package (as expected). For this reason, we use tiffinfo instead. It had no trouble reading the files. We've been using tiffinfo since before the initial commit of CommonSettings.php in 2012, bar the 18 months of T240455. The maximum TIFF metadata size has been 1 MiB since 2012 as well, but since these are photographic TIFFs (no giant text layer) and many aren't or are barely above 1 MiB anyway, I don't think it's the issue.


We can write whatever we want in the guidelines, but most TIFFs are not produced by the uploaders. They come from sources like the Library of Congress or the Internet Archive (who may get them from yet another source). Typically no changes are made between downloading from the source and uploading to Commons. While it would be nice to have all the image data compressed, even losslessly, I don't think it'll happen.

While it would be nice to have all the image data compressed

It was a side thought, I proposed it on wiki, but it should be independent of this.

The maximum TIFF metadata size has been 1 MiB since 2012.

Note some of the reported files are very old (relatively to mediawiki), I saw some from 2008- so it could be the issue has long been fixed, the error kept. The only thing I can think is to try to reupload a few of them, to see if we can reproduce it, otherwise force a "reparsing" of metadata manually from the list? What do you think?

If there's a way to reset the img_metadata without re-uploading files

I believe https://www.mediawiki.org/wiki/Manual:RefreshImageMetadata.php will do exactly that. Could you do a test run on an intentionally broken image on a test installation (I think you mentioned having a local test site) to verify it seems to work as expected? Some of those maintenance scripts are run very infrequently and not all are well maintained- I would prefer if we would be sure it wouldn't make things worse before running it in production :-).

I tested by doing the following:
My test setup is mediawiki-docker, with a mariadb database. PagedTiffHandler is installed, with the following configuration:

$wgDBprefix = "mw_";
wfLoadExtension( 'PagedTiffHandler' );
$wgTiffUseTiffinfo = true;
$wgTiffMaxMetaSize = 1048576;

I uploaded https://commons.wikimedia.org/wiki/File:Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff and, as expected, the metadata was read fine by tiffinfo. I also uploaded another non-broken TIFF and a PNG.

To break the file, I ran the following SQL commands to overwrite img_metadata with the error data from https://quarry.wmcloud.org/query/58620.

$ php maintenance/sql.php
> SELECT img_metadata FROM mw_image WHERE img_name = "Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff";
stdClass Object
(
    [img_metadata] => a:6:{s:9:"page_data";a:1:{i:1;a:5:{s:5:"width";i:9632;s:6:"height";i:14452;s:4:"page";i:1;s:5:"alpha";s:5:"false";s:6:"pixels";i:139201664;}}s:10:"page_count";i:1;s:10:"first_page";i:1;s:9:"last_page";i:1;s:4:"exif";a:15:{s:10:"ImageWidth";i:9632;s:11:"ImageLength";i:14452;s:13:"BitsPerSample";i:1;s:11:"Compression";i:4;s:25:"PhotometricInterpretation";i:0;s:11:"Orientation";i:1;s:15:"SamplesPerPixel";i:1;s:12:"RowsPerStrip";i:6;s:11:"XResolution";s:17:"838860800/2097152";s:11:"YResolution";s:17:"838860800/2097152";s:19:"PlanarConfiguration";i:1;s:14:"ResolutionUnit";i:2;s:8:"DateTime";s:19:"2000:10:26 10:30:01";s:6:"Artist";s:19:"Library of Congress";s:22:"MEDIAWIKI_EXIF_VERSION";i:2;}s:21:"TIFF_METADATA_VERSION";s:3:"1.4";}
)

> START TRANSACTION;
Query OK, 0 row(s) affected
> UPDATE mw_image SET img_metadata = "a:1:{s:6:\"errors\";a:1:{i:0;s:166:\"identify command failed: '/usr/bin/identify' -format \"[BEGIN]page=%p\\nalpha=%A\\nalpha2=%r\\nheight=%h\\nwidth=%w\\ndepth=%z[END]\" '/tmp/localcopy_991b2775dbb7.tiff' 2>&1\";}}" WHERE img_name = "Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_
ts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff";
Query OK, 1 row(s) affected
> COMMIT;
Query OK, 0 row(s) affected
> SELECT * FROM mw_image WHERE img_name = "Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff";
stdClass Object
(
    [img_name] => Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff
    [img_size] => 1088628
    [img_width] => 9632
    [img_height] => 14452
    [img_metadata] => a:1:{s:6:"errors";a:1:{i:0;s:166:"identify command failed: '/usr/bin/identify' -format "[BEGIN]page=%p\nalpha=%A\nalpha2=%r\nheight=%h\nwidth=%w\ndepth=%z[END]" '/tmp/localcopy_991b2775dbb7.tiff' 2>&1";}}
    [img_bits] => 0
    [img_media_type] => BITMAP
    [img_major_mime] => image
    [img_minor_mime] => tiff
    [img_description_id] => 1
    [img_actor] => 1
    [img_timestamp] => 20210915160957
    [img_sha1] => mfu1pqc20dg9cz8izurkut1e2agaf0g
)

The file broke in the expected manner, with fileicon.png replacing the thumbnail and the summary reporting 0x0 px, 0 pages.

Now onto fixing it. First I tried

$ php maintenance/refreshImageMetadata.php --mediatype BITMAP --mime image/tiff --broken-only --verbose
Processing next 2 row(s) starting with Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff.
Skipping File:Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff.
Skipping File:Newport_Steam_Factory,_449_Thames_Street,_Newport,_Newport_County,_RI_HABS_RI,3-NEWP,75-_(sheet_2_of_4).tif.

Finished refreshing file metadata for 2 files. 0 were refreshed, 2 were already up to date, and 0 refreshes were suspicious.

That did a whole lot of nothing. Both core and PagedTiffHandler willfully ignore errors in metadata so they don't have to keep checking if it's actually still broken. That means we'll have to run it with --force instead. The best way to specify the files appears to be --metadata-contains "command failed". That should produce a query equivalent to https://quarry.wmcloud.org/query/58636, which includes only the 52 files I found with Search.

$ php maintenance/refreshImageMetadata.php --mediatype BITMAP --mime image/tiff --metadata-contains "command failed" --force --verbose
Processing next 1 row(s) starting with Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff.
Forcibly refreshed File:Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff.

Finished refreshing file metadata for 1 files. 0 needed to be refreshed, 1 did not need to be but were refreshed anyways, and 0 refreshes were suspicious.

The file now works as expected, but just to be sure let's check the database entry:

$ php maintenance/sql.php 
> SELECT * FROM mw_image WHERE img_name = "Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff";

stdClass Object
(
    [img_name] => Hacienda_Azucarera_La_Concepcion,_Sugar_Mill_Ruins,_.3_Mi._W._of_Junction_of_Rts._418_and_111,_Victoria,_Aguadilla_Municipio,_PR_HAER_PR,11-VICT,1A-_(sheet_4_of_5).tiff
    [img_size] => 1088628
    [img_width] => 9632
    [img_height] => 14452
    [img_metadata] => a:6:{s:9:"page_data";a:1:{i:1;a:5:{s:5:"width";i:9632;s:6:"height";i:14452;s:4:"page";i:1;s:5:"alpha";s:5:"false";s:6:"pixels";i:139201664;}}s:10:"page_count";i:1;s:10:"first_page";i:1;s:9:"last_page";i:1;s:4:"exif";a:15:{s:10:"ImageWidth";i:9632;s:11:"ImageLength";i:14452;s:13:"BitsPerSample";i:1;s:11:"Compression";i:4;s:25:"PhotometricInterpretation";i:0;s:11:"Orientation";i:1;s:15:"SamplesPerPixel";i:1;s:12:"RowsPerStrip";i:6;s:11:"XResolution";s:17:"838860800/2097152";s:11:"YResolution";s:17:"838860800/2097152";s:19:"PlanarConfiguration";i:1;s:14:"ResolutionUnit";i:2;s:8:"DateTime";s:19:"2000:10:26 10:30:01";s:6:"Artist";s:19:"Library of Congress";s:22:"MEDIAWIKI_EXIF_VERSION";i:2;}s:21:"TIFF_METADATA_VERSION";s:3:"1.4";}
    [img_bits] => 0
    [img_media_type] => BITMAP
    [img_major_mime] => image
    [img_minor_mime] => tiff
    [img_description_id] => 1
    [img_actor] => 1
    [img_timestamp] => 20210915160957
    [img_sha1] => mfu1pqc20dg9cz8izurkut1e2agaf0g
)

I don't have useJsonMetadata enabled for LocalRepo, so img_metadata stayed PHP serialized. In production, files with refreshed metadata should have img_metadata serialized as JSON, like newly-uploaded files do. This shouldn't cause any problems. --force may not be necessary because of the reserialization, but I think it is safer to leave it in to ensure that the files do get their metadata replaced instead of just reserialized.

Thank you very much for your tests, this gives me the confidence to run it in production (I will do a similar test on testwiki first).

I will do it soon, as I am invested on fixing image metadata (see similar, but different, T289996).

Mentioned in SAL (#wikimedia-operations) [2021-09-24T10:44:18Z] <jynus> corrupting and fixing image metadata on testwiki before running script on commons T290462

So I uploaded to testwiki Tiff_test.tiff .

This was the original metadata:

{"data":{"page_data":{"1":{"width":100,"height":100,"page":1,"alpha":"false","pixels":10000}},"page_count":1,"first_page":1,"last_page":1,"exif":{"ImageWidth":100,"ImageLength":100,"BitsPerSample":[8,8,8],"Compression":1,"PhotometricInterpretation":2,"StripOffsets":8,"Orientation":1,"SamplesPerPixel":3,"RowsPerStrip":128,"StripByteCounts":30000,"XResolution":"72/1","YResolution":"72/1","PlanarConfiguration":1,"ResolutionUnit":2,"MEDIAWIKI_EXIF_VERSION":2},"TIFF_METADATA_VERSION":"1.4"}}

Then I corrupted it with:

(testwiki)> UPDATE image SET img_width = 0, img_height = 0, img_metadata = "a:1:{s:6:\"errors\";a:1:{i:0;s:166:\"identify command failed: '/usr/bin/identify' -format \"[BEGIN]page=%p\\nalpha=%A\\nalpha2=%r\\nheight=%h\\nwidth=%w\\ndepth=%z[END]\" '/tmp/localcopy_991b2775dbb7.tiff' 2>&1\";}}" WHERE img_name='Tiff_test.tiff';
Query OK, 1 row affected (0.002 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Then I ran:

mwscript maintenance/refreshImageMetadata.php --wiki=testwiki --start="Tiff_test.tiff" --end="Tiff_test.tiff" --force --verbose
Processing next 1 row(s) starting with Tiff_test.tiff.
Forcibly refreshed File:Tiff_test.tiff.

Finished refreshing file metadata for 1 files. 0 needed to be refreshed, 1 did not need to be but were refreshed anyways, and 0 refreshes were suspicious.

Which corrected the image dimensions, but didn't fix the img_metadata. This is the result:

(testwiki)> select * FROM image where img_name='Tiff_test.tiff'\G
*************************** 1. row ***************************
          img_name: Tiff_test.tiff
          img_size: 227712
         img_width: 100
        img_height: 100
      img_metadata: {"data":{"page_data":[],"errors":["no page data found in tiff directory!"],"exif":[],"TIFF_METADATA_VERSION":"1.4"}}
          img_bits: 0
    img_media_type: BITMAP
    img_major_mime: image
    img_minor_mime: tiff
img_description_id: 192325
         img_actor: 29763
     img_timestamp: 20210924103506
          img_sha1: spdeo7s7ic56z83eaw0b9281u2pzr21
1 row in set (0.000 sec)

The size was corrected, but the img_metadata gives an error. This makes me thing that running this script is not safe on production, and must have some bug- there is no reason for a metadata refresh to fail in this case, so this looks to me like a bug on the maintenance script. Could this be related to recent work on metadata refactoring for certain files? The original file seems to use PHP serialized and the new one seems to do JSON.

The way in which this fails makes no sense- it does something (fixes the dimensions), but doesn't get the metadata that did successfully on first upload!

Ladsgroup told me that there is another script that was run against other files, that will likely work better (rebuildFileMetadata.php). I will try that next, if I can use it on specific files.

Edit: I got clarified that that won't work either, as it doesn't load the file, and I get recommended to just reupload. Waiting for Structured data comments now.

@jcrespo Structured Data doesn't really have any expertise on this - the "metadata" we deal with is user-added, and not related to metadata embedded in the image

@Cparle apologies if you are the wrong team. I reached to SD Eng because it is what I interpreted as documented on Mediawiki.org as maintainers ("Management of uploaded files (images, thumbnails, etc.)"). Because of the docs there, I thought that your team did indeed not develop new features in the generic image pipeline, but that page led me to think SD was the right one for high-impact file-related bug fixes (which I believe this is). The wiki may need update.

Apologies again- from technology we don't have much visibility into Product organization -do you happen to know who to contact (someone that may have the expertise/ownership)?

MediaWiki-File-management lists Platform Engineering as maintainers, so something is out of sync between mw.o and Phab.
This is reflective of a larger problem around maintenance of media-related features though, especially post-​Multimedia. Thumbor is not owned and has no clear maintainer, and various other parts of the ecosystem are split between PE, SDE, PI, and individual core devs.

something is out of sync between mw.o and Phab

:-( I will try to raise this issue, to at the very least clarify the current status. Please @Cparle help me do the same, so I don't ping the wrong people again.

@jcrespo no apology necessary! We evolved out of the multimedia team, and I see we are listed as maintainers - if @MarkTraceur agrees this falls under our purview then we're certainly happy to take a look, but we're down to 3 engineers, one of whom is going on paternity leave in 10 days, and none of us have any recent experience (in the last ~4 years) with metadata embedded in images

none of us have any recent experience (in the last ~4 years) with metadata embedded in images

If my work done on debugging the issue is right- the software works well already on standalone mediawiki installations (see @AntiCompositeNumber's tests), so no real fix there is needed there, what I think it is failing is that on remote/distributed storage installations (like Wikipedia's swift), there is a missing step on downloading locally the image again to reanalyze-it. Although I may be wrong- because it fixes the dimensions correctly, and I am unsure how that can happen without access to the original :-/.

Yeah, as mentioned above, I don't think the Structured Data team has the requisite expertise to handle this, especially given our reduced capacity - I believe even the 10 days estimate above proved to be overly optimistic. It's always sad for me to say that we can't help out when Commons, or media support in general, has an issue, but at the moment I don't believe we can do anything. Possible that someone with more experience debugging Swift, or working with it, might be able to do something.

There are also DjVu and PDF files which possess the same issue, see https://commons.wikimedia.org/wiki/User:Jonteemil/sandbox2. Are they also unfixable except with reupload?

Also GIF files (that I just uploaded) seem to have this issue.

https://commons.wikimedia.org/wiki/File:Mwcli_v0.6_v0.7_version.gif
https://commons.wikimedia.org/wiki/File:Mwcli_v0.6_v0.7_Installation.gif
https://commons.wikimedia.org/wiki/File:Mwcli_v0.6_v0.7_docker_dev_create_part_1_(init).gif
etc.
(you can find the others in https://commons.wikimedia.org/wiki/Category:Mwcli )

I uploaded 15 gif files all at the same time and they all appear to have this issue.
None of these files are "big"

The all say 0 pixels

image.png (521×1 px, 47 KB)

and including them in pages doesn't appear to work.
But the raw file is there and fine..
https://upload.wikimedia.org/wikipedia/commons/6/6f/Mwcli_v0.6_v0.7_version.gif

If this is unrelated I'll go and file another ticket?

Also GIF files (that I just uploaded) seem to have this issue.

https://commons.wikimedia.org/wiki/File:Mwcli_v0.6_v0.7_version.gif
https://commons.wikimedia.org/wiki/File:Mwcli_v0.6_v0.7_Installation.gif
https://commons.wikimedia.org/wiki/File:Mwcli_v0.6_v0.7_docker_dev_create_part_1_(init).gif
etc.
(you can find the others in https://commons.wikimedia.org/wiki/Category:Mwcli )

I uploaded 15 gif files all at the same time and they all appear to have this issue.
None of these files are "big"

The all say 0 pixels

image.png (521×1 px, 47 KB)

and including them in pages doesn't appear to work.
But the raw file is there and fine..
https://upload.wikimedia.org/wikipedia/commons/6/6f/Mwcli_v0.6_v0.7_version.gif

If this is unrelated I'll go and file another ticket?

Please file a separate task; anything occurring with new, non-TIFF uploads is unrelated.

@jcrespo @AntiCompositeNumber Given that nothing has happened since october, should I just ask the uploaders or any user to reupload?

Same issue found on betawikiversity: https://beta.wikiversity.org/wiki/File:Joonitud_EMH_3196_2.tif
I'm not sure if this task has relation with T297942.

image.png (316×953 px, 22 KB)

I have resaved the smaller files using the CropTool which was a successful workaround but the ones with 100MB+ were to big so they still remain unrendered.