Page MenuHomePhabricator

Metadata of a PDF in image table dump does not match the website
Open, MediumPublic

Description

There is one strange file I found: https://commons.wikimedia.org/wiki/File:Verzeichni%C3%9F_einiger_Br%C3%BCderschaften,_die_im_katholischen_Franken_im_Gange_sind.pdf

The image table SQL dump (I checked through multiple, including the most recent one) has its width and height equal to 0, but API returns width and height: https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=mime|size|metadata|bitdepth&titles=File:Verzeichni%C3%9F_einiger_Br%C3%BCderschaften,_die_im_katholischen_Franken_im_Gange_sind.pdf&format=json

Could the image table SQL dump be fixed?

Event Timeline

ArielGlenn triaged this task as Medium priority.Feb 28 2022, 1:38 PM

Thanks for the report!

I checked the image table directly, and indeed the image height and width are 0. I don't know what populates them, but the dumps simply grab all the rows (via mysqldump) and write down the results. So if we want to see why there are 0's, other investigation must be done.

I'm not sure who ought to look at this further, though someone should :-)

Given that this is the only row where this is the case (I wen through whole dump), could I suggest that somebody just writes the numbers in and this is it? :-) Investigating how this happened and why does the website still report correct numbers (maybe it is some cache?) might be more work.

I've had a look at the image dujmps across all the wikis (in lieu of running expensive queries against the database for all of them) and have found a small (a few hundred_ number of files with the same problem, mostly pdfs. I'll look into it a little further to see if they have something in common, and post the results here. If there is some common bug that is responsible for this (for example, they were all uploaded aorund the same time), then it's possible someone could write a maintenance script to fix the entries for that period of time. I would definitely not be comfortable myself just sticking my hand in there and updating isolated fields manually.

More info soon when we have it.

What was the condition you searched for? Because there are PDFs which have 0x0 in the database but also in the web interface. See T301291. At least in English Wikipedia and Wikimedia Commons I could not find any other PDF or Djvu which would have 0x0 but in the web interface a reasonable number.