Page MenuHomePhabricator

Consider our options for fixing existing files affected by T97253
Closed, ResolvedPublic


We have an unknown, but large, amount of JPG and TIF files affected by T97253: Exif values retrieved incorrectly if they appear before IFD – they have bogus data in the image.img_metadata database field. When T140419 is fixed, newly uploaded files will no longer have the problem, but old ones will not be magically fixed.

We can identify some of the problematic files easily if img_metadata contains binary data, but it's not a perfect test (some affected files just have wrong, but plausible values; some unaffected files have binary stuff which actually exists in the file).

Some stats for 100,000 last uploaded files on Commons (there are 32,500,000 files there, including 28,000,000 JPG files):

mysql:research@analytics-store.eqiad.wmnet [commonswiki]>
  select img_major_mime, img_minor_mime, count(*) from
  (select * from image order by img_timestamp desc limit 100000) image2
  where img_metadata rlike concat('[^\\t\\r\\n -~', X'802dff', ']')
  group by img_major_mime, img_minor_mime;
| img_major_mime | img_minor_mime | count(*) |
| application    | ogg            |        1 |
| audio          | wav            |       26 |
| audio          | x-flac         |        1 |
| image          | jpeg           |     4756 |
| image          | png            |       10 |
5 rows in set (32.36 sec)

Let's say that the files with binary metadata are around 5% of all files. The number of broken files which are not easily identifiable is unknown.

So… what can we do to fix them? Options I see:

  • Do nothing. Allow users to manually correct broken files with action=purge. This requires a little development, and sucks for our users.
  • Run a maintenance script for the JPG/TIF files with binary metadata. Allow users to manually correct others with action=purge. This requires some more development, but doesn't suck as bad for our users.
  • Run a maintenance script for all JPG/TIF files. No development (we already have /maintenance/refreshImageMetadata.php) and users are happy.

I would prefer the last option :) But I don't know if our media storage stuff and database stuff can handle it, and how long would it take. @aaron @jcrespo What do you think?

(As a side note, we could also invalidate the metadata with ExifBitmapHandler::isMetadataValid() or Exif::version(). But since the bug depends on the version of PHP/HHVM, non-WMF users of MediaWiki would need some extra version checks around any of this. That sounds pretty awful and I would rather not do it.)

Event Timeline

Restricted Application added subscribers: Zppix, Steinsplitter, Aklapper. · View Herald Transcript

Since the script does one a file at a time and uses DB query batching, the last option seems simple and OK. It might take a while though...

I just found an old task which also was requesting the refreshImageMetadata.php to be ran on all wikis (and the run was even started, but then killed and never finished). Let's just do this. Please continue on T32961 :)