Page MenuHomePhabricator

Regression in processing PDFs on Wikimedia Commons: No width, height, no metadata
Open, LowestPublicBUG REPORT

Description

List of steps to reproduce:

What happens?:

20220220 has the following entry:

('1988_810810_Aielo_de_Malferit.pdf',71798,1239,1754,'{\"data\":{\"Producer\":\"iText 1.4.7 (by lowagie.com)\",\"CreationDate\":\"Mon May  5 17:54:06 2008 UTC\",\"ModDate\":\"Mon May  5 17:54:06 2008 UTC\",\"Tagged\":\"no\",\"UserProperties\":\"no\",\"Suspects\":\"no\",\"Form\":\"none\",\"JavaScript\":\"no\",\"Pages\":\"1\",\"Encrypted\":\"no\",\"pages\":{\"1\":{\"Page size\":\"595 x 842 pts (A4)\",\"Page rot\":\"0\"}},\"File size\":\"71798 bytes\",\"Optimized\":\"no\",\"PDF version\":\"1.4\",\"mergedMetadata\":{\"pdf-Producer\":\"iText 1.4.7 (by lowagie.com)\",\"pdf-Encrypted\":\"no\",\"pdf-PageSize\":[\"595 x 842 pts (A4)\"],\"pdf-Version\":\"1.4\"},\"text\":[\"\",\"\"]}}',0,'OFFICE','application','pdf',44,543926,'20121213213138','643t3fa39pqaw1zecd7ybn3r3g99isu')

Observe 1239 and 1754 for width and height, respectively. And "Pages": "1" in metadata.

20220401 has the following entry:

('1988_810810_Aielo_de_Malferit.pdf',71798,0,0,'',0,'OFFICE','application','pdf',44,543926,'20121213213138','643t3fa39pqaw1zecd7ybn3r3g99isu')

Observe no width, height, and no metadata.

What should have happened instead?:

Both dumps should have the same PDF metadata: PDF has not changed since the last dump. But now PDF has no width/height or page count information anymore.

Recently metadata of many files was updated because of the change to json for pdfs and djvu files.
Confirmed no metadata, no width and height.

Event Timeline

TheDJ updated the task description. (Show Details)
TheDJ triaged this task as Lowest priority.May 20 2022, 6:14 PM
TheDJ added a project: User-TheDJ.

I will look in logstash if i can find any details.

Aklapper renamed this task from Regression in processing PDFs on Wikimedia Commons to Regression in processing PDFs on Wikimedia Commons: No width, height, no metadata.Nov 22 2022, 1:55 PM

Can someone with Shell rerun the refresh on this specific file and see if that fixes it ? https://commons.wikimedia.org/wiki/File:1988_810810_Aielo_de_Malferit.pdf

mwscript refreshImageMetadata.php --wiki=commonswiki --verbose --force --start 1988_810810_Aielo_de_Malferit.pdf --end 1988_810810_Aielo_de_Malferit.pdf

I suspect that one of the refresh runs of T275268 caused some problems and it wasn't noticed at the time. If this helps, we can run it on more pdfs with page size 0, but if it doesn't help, then hopefully this will output some information that might indicate what the problem could be.