Page MenuHomePhabricator

EXIF handling ignores tags with Count >= 1 (such as GPS coordinates)
Closed, ResolvedPublic

Description

Author: overlordq

Description:
I noticed this today when a user came into MediaWiki-General and missing EXIF information.

The tags in question were the GPSLatitude, GPSLongitude, etc. After looking through the code, and then looking through the EXIF spec, it appears that the MW Exif handler does not gracefully handle tags which are arrays.

Pulling up the spec: http://www.digicamsoft.com/exif22/exif22/html/exif22_22.htm?gInitialPosX=10px&gInitialPosY=10px&gZoomValue=100

And looking at for example at BitsPerSample, with type short and count of 3. AFAIK, this means that BitsPerSample should be an array of 3 shorts. Unfortunately Exif.php doesn't handle this.

After turning on debugging and uploading a GPS tagged image, this output was obtained:

Exif::validate: tag is 'GPSLatitude' (type: array; content: 'Array
(

[0] => 36/1
[1] => 3/1
[2] => 5896/100

)
')
Exif::isRational: fed a non-fraction value (type: array; content: 'Array
(

[0] => 36/1
[1] => 3/1
[2] => 5896/100

)
')
Exif::makeFilteredData: 'GPSLatitude' contained invalid data (type: array; content: 'Array
(

[0] => 36/1
[1] => 3/1
[2] => 5896/100

)

')

Looking at the the source on line 470: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Exif.php?annotate=23531

it explicitly disallows arrays.

Marking critical as although minor, EXIF tagged data in images can and is being lost.


Version: 1.12.x
Severity: normal
URL: http://commons.wikimedia.org/wiki/File:2005-09-17_10-01_Provence_641_St_R%C3%A9my-de-Provence_-_Glanum.jpg

Details

Reference
bz13172

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:02 PM
bzimport set Reference to bz13172.
bzimport added a subscriber: Unknown Object (MLST).

overlordq wrote:

Patch to Exif.php and Bitmap media handler

It's rather big, tested it on local wiki and new image uploads and seems to be working fine. Feedback appreciated.

attachment exif_patch.txt ignored as obsolete

overlordq wrote:

And of course I just realized that I forgot to remove the "return $tags;" line at the very end of the patch.

*** Bug 15546 has been marked as a duplicate of this bug. ***

This bug completely breaks Exif GPS coordinate support, as can be seen at the linked image description page. It really should be fixed ASAP, or else we shouldn't even be pretending to support GPS data at all.

Committed the patch (with slight modifications) in r49677. We really ought to rewrite the FormatExif class so that we can merge the "GPS*Ref" values with their parent values (e.g. "Longitude = 43° 46' North" rather than separate entries for "Longitude = 43° 46', Longitude reference = North"), which is something the current design makes difficult, but at least the values are displayed now.

...and I'm too tired to tell latitude from longitude anymore. I should go get some sleep. :)

Reverted, needs complete backwards compatibility with stored metadata to avoid the potential for a file storage backend overload due to heavy recache traffic.

(In reply to comment #8)

Reverted, needs complete backwards compatibility with stored metadata to avoid
the potential for a file storage backend overload due to heavy recache traffic.

Okay, fair enough. (Should we tag the metadata version field with a /* do not change, ever */ comment? ;)

I could try to rework the patch to remain as backwards-compatible as possible, but there's a problem: the current implementation simply throws out some useful metadata fields (including the GPS location data this bug was initially about) and there's no way for code loading the serialized metadata to know (without rescanning the file) if that might've happened or not.

For the GPS fields specifically, this could perhaps be kluged around, since the old implementation did retain the "Ref" fields and the presence of those should be a pretty good indicator that a coordinate value should also have been there. But in general, we'd need some way of gradually regenerating the metadata for all files without overloading the backend. Ideas, anyone? Perhaps a maintenance script?

Anyway, I probably won't have time to do anything about this for this month at least, and possibly longer, so anyone else who's interested is welcome to pick up the ball.

Patch to Exif.php to fix this for GPS coordinates only

Since doing this the proper way (as in Brent's patch) and solving it for all array-valued tags apparently is not possible (if metadata format is changed, that may always break older installations since the formatter doesn't check the version), let's at least fix this for the GPS coordinates.

This patch solves this bug for the GPS coordinates only without any change to the metadata format and leaving Exif::version() at 1. It simply re-encodes a GPS coordinate given as "degrees-minutes-seconds.fraction" as a single value "degrees.fraction". Other fields that should be arrays are not corrected.

Also fixes formatting for GPSAltitudeRef and GPSDestDistanceRef as Ilmari did in r49677.

This fixes the problem with the GPS coordinates for all new uploads. For already uploaded (and also deleted?) images, Ilmari's suggestion seems a good one.

The query

SELECT * FROM /*$wgDBprefix*/image
WHERE img_metadata LIKE '%GPSLatitudeRef%'
AND img_metadata NOT LIKE '%GPSLatitude%'

should find all entries in the image table that do have a problem with GPS coordinates. For these files the metadata would need to be re-computed, possibly by a batch job.

attachment 13172.patch ignored as obsolete

Bah, forget the select; shouldn't post shortly around midnight. Still, looking for metadata that has GPSLatitudeRef, but not GPSLatitude should give all entries where the metadata should be recomputed.

(In reply to comment #10)

This patch solves this bug for the GPS coordinates only without any change to
the metadata format and leaving Exif::version() at 1. It simply re-encodes a
GPS coordinate given as "degrees-minutes-seconds.fraction" as a single value
"degrees.fraction". Other fields that should be arrays are not corrected.

I took a quick look at the patch and installed it on my own wiki. It seems to work OK, in the sense that the latitude and longitude are now shown included in the metadata and displayed on the description page. I also found that, apparently, action=purge is now enough to fix existing images (although running a maintenance script at some point still seems like a good idea).

The use of the toRational() method is a really ugly workaround, though. I'm not saying there's anything wrong with the method itself -- it seems to be quite nice and well written -- but calculating a rational approximation of a floating point value just so that we can pass it through a broken validator method that expects a rational and then converting it back to floating point seems ridiculous. It would seem easier to just change the datatypes for those tags to, say, Exif::ASCII and just store the float as a string. Or add a new Exif::FLOAT datatype. Or perhaps Exif::SPECIAL or Exif::ANYTHING or Exif::THEVALIDATORISBROKENANYWAY.

Modified patch to Exif.php to fix this for GPS coordinates only

Okay, I took Lupo's patch and modified it a bit to get rid of the toRational() hack. What I'm doing is just setting the datatype to Exif::ASCII and passing the float as is: the validation only checks that values of type Exif::ASCII don't contain any non-ASCII characters, which is true for numeric values. Yes, it could break if someone made the validation stricter, but I very much doubt anyone ever will, since the whole thing is both broken and pointless. Nobody sane is going to touch it except either to work around its brokenness or to scrap and rewrite it completely.

I also added a method to format the coordinates nicely as degrees, minutes and seconds, rather than presenting them as just a bare float. This part doesn't affect the stored metadata in any way, just the presentation.

If nobody spots any obvious problems with this, I'm planning to commit it shortly.

(Ps. In the future, I think the right way forward would be to design a saner metadata storage format, rewrite the whole thing (both the back and the front ends) and only retain enough of the current code in a deprecated compatibility class to be able to keep displaying old metadata blobs. Then we could write a maintenance script to regenerate the metadata for old files and, later, when most files on most wikis have been converted to use the new format, get rid of the compatibility class.)

attachment bug13172.diff ignored as obsolete

Modified patch to Exif.php to fix this for GPS coordinates only

Oops, I seem to have mangled some Unicode characters in a comment. Patch fixed.

attachment bug13172.diff.1 ignored as obsolete

Are there any tools that might break if they encountered a float instead of a rational in these GPS fields? I don't know, and that's why I reconverted the value into a rational: to make sure the stored metadata looks exactly the same as it did before. (There's at least one bot at the Commons, run by Dschwen, that generates and adds {{location}} templates from the GPS information in the metadata.) If you store a float there, tools must now be ready to expect either a rational or a float there. Note that the original Exif from the file as returned by exif_read_data may already contain a single value as a rational if the raw Exif data contained a fractional degree. exif_read_data appears to return an array for these fields only if the degrees are an integral value (denominator 1).

I'm quite sure that exif_read_data() will never return a single rational for the GPS latitude/longitude fields, at least not for images with valid Exif metadata. The Exif standard (JEITA CP-3451, section 4.6.6) says those fields have type=rational, count=3, which means their values will always consist of exactly 3 rationals.

Your point about bots receiving metadata via the API is valid, though, and I should look into that before deploying this. Then again, no bot has ever received Exif GPS coordinates from MediaWiki before, since those fields have always been thrown away by the broken validator. Dschwen's bot AFAIK works by downloading the images themselves and using a non-broken Exif parser on them.

I just ran an API query on a Commons image with GPS metadata, uploaded to my own wiki with the patch applied, and it seems to work well enough:

http://vyznev.net/lupwiki/File:Panorama_Mount_Sinai_south.jpg?uselang=en

http://vyznev.net/lupwiki.d/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Panorama_Mount_Sinai_south.jpg

An alternative approach that comes to mind, which would also completely eliminate any precision loss, would be to just convert the original rationals to strings and concatenate them to something like "28/1 32/1 2065/100". That would make the pretty-printing a little more complicated, but not very much so.

..or we could just convert the values to a single rational in a much simpler way: instead of going via floats, just multiply the denominators by 1, 60 and 3600 respectively, convert them to a common denominator (we already have a gcd method) and add them together.

re: exif_read_data returning a single rational:

Yes, I also understand the EXIF spec to say that GPS coordinates are always three rationals. It appears, however, that actual implementations apparently treat it as "at most three rationals". Note that the current implementation works for files that have the coordinates as single fractional degrees. An example file is http://commons.wikimedia.org/wiki/File:DubocePark.jpg

The file linked contains in the GPS IFD the sequence

00 02 00 05 00 00 00 01 00 00 0E DF

which is "00 02"->(tag: GPSLatitude), "00 05"->(type: Rational), "00 00 00 01"->(count: 1), "00 00 0E DF"->(offset for value > 4 bytes). The value at the offset is then "02 40 51 8C 00 0F 42 40", which is 37769612/1000000, i.e. 37.769612 degrees.

So it appears that either the Nikon D80 used for that image generates incorrect EXIF, or our understanding of the standard is wrong. The interpretation of "up to three" appears to be consistent with the TIFF 6.0 standard http://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf to which the EXIF standard explicitly defers to in its section 4.6.2.

In either case it appears that our code should be able to handle up to three rational values, not just exactly three values. It looks as if exif_read_data does this, returning a single rational for File:DubocePark.jpg, which has only one value in the EXIF, but returning an array if there are several values in the EXIF. Other EXIF viewers also can cope with "up to three" rationals; they don't require exactly three.

OK, in that case just converting to a single rational seems like the best thing to do. I'll try to write a new implementation later today that does this without an intermediate conversion to floating point.

Patch to Exif.php to fix this for GPS coordinates only (using rational math)

OK, here it is. All GPS coordinates are converted to a single rational without using an intermediate floating-point representation. Tested and seems to work. The new pretty-printer can actually handle either rationals or floats.

attachment bug13172.v3.diff ignored as obsolete

Patch to Exif.php to fix this for GPS coordinates only (using rational math; fixed)

OK, it seems the previous version of the patch had some bugs and issues with numerical overflow. I've hopefully fixed them in this version, although at the cost of some extra complexity.

Even though all the values should be integers, I'm now using fmod() and sprintf("%.0f") in order to get more more precision and better overflow behavior. Even so, I expect that there are technically valid inputs that can cause loss of precision -- after all, even IEEE doubles only have a 52-bit mantissa. Still, the code should hopefully return approximately correct values even in that case, though I haven't yet stress-tested it very thoroughly.

Also, I made the pretty-printed output localizable. The default format is still $1° $2' $3", but now it can be changed for languages in which some other format is normally used.

attachment bug13172.v3.diff ignored as obsolete

Test harness for addFractions/toRational

Cool. I like the localizability of the final output.

However, for some pathological (but legal) input the result may still overflow the 32bit range.

toRational has the same problem on 64bit systems where PHP_INT_MAX is larger than 2**31. This can be solved easily by not using PHP_INT_MAX but 2147483647 directly.

Since EXIF is limited to 32bits for these values, I think it's important that we don't overflow this range. Otherwise we may get incompatibilities between MediaWiki installations running on 32bit or 64bit systems...

Attached is a test program that tests both approaches for correctness, 32bit overflows, and speed. Interestingly, it looks as if toRational (with the above fix for 64bit systems) is faster. All these gcd calls in addFractions appear to take quite some time.

Please take a look at the attached test program.

attachment toRationalTest.php ignored as obsolete

Sample output of test program

Tests were run on an iMac (64bit, i5). A full test would take about an hour; but I don't expect any surprises in the omitted parts. Will let it run tonight, when the machine is idle otherwise. Right now the timing might be skewed due to other activity on this machine.

attachment toRationalTestResults.txt ignored as obsolete

As expected, the full test results showed no surprises. Both variants (addFractions/toRational) handle "normal" GPS input of the form deg/1 min/1 sec/100 fine. In my test run, addFraction consistently was one third slower than toRational (addFraction taking 100sec per 3600000 iterations, toRational 75sec).

There's an even simpler way to convert to a single fraction: given that apparently the conversion via float is faster, and given that we're dealing with values in the range 0.0 .. 180.0 only, we could just accumulate the value in a float, round that float to 7 fractional digits, set the numerator to intval (float * 10000000) and fix the denominator at 10000000.

The values are all be within the 32bit range, and a precision of 7 fractional digits is sufficient: that corresponds to about 1cm. That is, AFAIK, better than what's possible with current GPS, and in any case certainly good enough to pinpoint a photographer's location.

An implementation of this on my private wiki is twice as fast as the toRational method, passes all the tests from my test program, and has far less code that is easier to understand to boot.

Ilmari, would it be OK if I obsoleted your patch with a new one doing all the formatting stuff from your patch, but using this method to calculate the single rational value?

Sure, go ahead. Actually, I'd thought of doing it that way too, but hadn't got around to trying it yet.

(For numerical accuracy, it might be slightly better to work internally in units of 1/10000000 degrees and then just set the value to sprintf("%.0f/10000000", num). Or, better yet, make the denominator a class constant or something.)

Patch to Exif.php to fix this for GPS coordinates only (using fixed denominator 10000000)

Ok, here it is. Also limits the fractional digits in coordinate output (it makes no sense displaying fractional seconds with more than 3 fractional digits).

Additionally, the Formatter now can format coordinates by e.g. suffixing the direction, so that you can get 47° 15' 24.123" N. (Configurable through messages exif-coordinate-format-{nesw}.)

Attached:

While this does fix the issue for GPS coordinates as advertised, I don't think we should apply it. It's a hack where the full functionality of the original patch applied in r49677 should be used.

Tim's concerns with regards to back-compatibility should be addressed. IMHO: files with no arrays in their exif data should be left untouched. Brent has commit access now (he didn't when the bug was reported), so maybe he'd like to poke at this again :)

Of course it's a hacky work-around that tries to shoehorn array-valued GPS data into the broken current stored metadata format, which doesn't have array-valued fields.

Maybe I'm missing something, but I think we cannot change the stored metadata format. (And that's independent of Tim's worry about metadata being recomputed from the file if Exif::version changes.) If we change the format of the stored metadata, we'll break Exif display all older MW installations that get served the new format (e.g. through InstantCommons), but that don't know that new format yet. Even if we changed BitmapHandler and Exif::formatData to understand both the old and the new format, there would still be old installations out there not having that change. See http://www.mediawiki.org/wiki/Special:Code/MediaWiki/49677#c2475

With the current format, I don't see a way to correctly handle the more esoteric array-valued Exif fields such as the white point or YCbCrSubSampling. But at least we could fix it for the most visible cases: the GPS attributes. Incidentally, I've only seen complaints about GPS data not being handled correctly, but so far I've not seen a single complaint about missing white point data.

overlordq wrote:

Looking at it again is in my plans, but relatively busy IRL atm so it's off to the side. It's been a while since I looked at it, but the main problem is parsing the data from the files, once that's done it shouldn't be too hard to shoehorn it into the old serialized format, I just took the past of least resistance.

overlordq wrote:

*** Bug 24423 has been marked as a duplicate of this bug. ***

overlordq wrote:

*** Bug 24888 has been marked as a duplicate of this bug. ***

bugzilla wrote:

Having upgraded mediaWiki to 1.16.0, I still find that GPS Latitude and Longitude simply are NOT displayed for an image from a Nikon D300.

What am I missing here?

(In reply to comment #34)

Having upgraded mediaWiki to 1.16.0, I still find that GPS Latitude and
Longitude simply are NOT displayed for an image from a Nikon D300.

What am I missing here?

The bug isn't fixed yet. (I have done work towards that in a branch, so if you used the code in the branch it would work, but that code is still experimental/not done yet)

(In reply to comment #35)

(In reply to comment #34)

Having upgraded mediaWiki to 1.16.0, I still find that GPS Latitude and
Longitude simply are NOT displayed for an image from a Nikon D300.

What am I missing here?

The bug isn't fixed yet. (I have done work towards that in a branch, so if you
used the code in the branch it would work, but that code is still
experimental/not done yet)

My branch is now merged into trunk. Assuming no one finds any major issues with it, fixed as of r86169.