Page MenuHomePhabricator

Expose EXIF data to search engine
Open, MediumPublic

Description

File type search has been warmly welcomed by the communities. Perhaps there's more metadata we could expose to CirrusSearch to improve multimedia search.

EXIF data has additional potential fields. Most obvious is camera model/type, focal length, aperture, ISO, and orientation.

http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2016, 10:09 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptNov 15 2016, 10:09 PM

@Smalyshev could you provide some feedback on the feasibility of this task?

I'd start with figuring out which fields we want. Once we know it, it's not too hard to implement, the same way as file info fields we just did, but would require reindexes again. Infrastructure for it is there, so it shouldn't be too hard IMHO.

That is if we want to take the route of representing it via ElasticSearch fields and not structured data fields for in-progress Structured Commons project. Both are possible. We can even do both but it sounds like duplicating the effort.

I think we need to get some feedback from the community on this - which ones are important,

There is also a thing that we need to be aware of and that is that EXIF code in PHP is not too great and applying it to random user-provided images may expose some security issues. Not sure what is the status of the same in HHVM and whether it's even the same code, this needs to be checked. But that's something we definitely need to be aware of when implementing it.

Deskana triaged this task as Medium priority.Nov 17 2016, 11:15 PM
Deskana added a subscriber: Deskana.

As @Smalyshev notes, the next action here is to decide what fields from the EXIF data that we want. Thoughts?

<tongueInCheek>
Hmm. Maybe if we had someone would could liaise with the community on Commons? They could post a question to the community there and see if there's any strong support for indexing particular fields? We could call this person a populace interfacer.

Ok, maybe we need to work on the name a bit. :)
</tongueInCheek>

In all seriousness, if it sounds agreeable to you both, I'd be happy to write something up to share with Commons about what they might wish to include.

In all seriousness, if it sounds agreeable to you both, I'd be happy to write something up to share with Commons about what they might wish to include.

Thanks! That'd be great. They'd know better than us what would be useful.

I just posted a question to Commons about the usefulness of this feature and to spur discussion:

https://commons.wikimedia.org/wiki/Commons:Village_pump#Exif_metadata_and_search

Small update, we had a few responses on the Commons Village pump. Summarzing:

  • There are concerns over some of the data stored in EXIF being messy - text vs numbers, duplicate and possibly contradicting information in fields, etc. However, folks do seem to want to expose some of this information to search, regardless of it not being perfect.
  • Some use cases:
    • help greatly with finding files with unexpected copyright statements in description fields
    • Incorrect color profiles (metadata intended for print and not on-screen display)
    • dates (such as date photo was taken) not matching what was reported when uploading to Commons
    • discovering files with odd EXIF elements showing the files may have other compressed files hidden within it (to perhaps hide a file within a file for illicit sharing purposes)
  • We should also consider XMP and IPTC data. The EXIFTOOL (already used by MediaWiki) provides an excellent means to extract this data from most media file types.
  • Some tags make sense as a number rather than as text (e.g. exposure length, aperture, ISO, exposure compensation).

Another recommendation was to maybe create something like Special:EXIF that would "indexes and collates all fields in an easy way to navigate".

Thanks for gathering the comments, Chris! I want to unpack a few of these comments so I'm sure I understand them.

help greatly with finding files with unexpected copyright statements in description fields

I don't understand this one. Is this about trying to detect uploaders putting copyright statements in the EXIF data of their photos that differ from the data on-wiki? Do photographers really put copyright statements in EXIF data? I'm not a photographer, so I don't know.

Incorrect color profiles (metadata intended for print and not on-screen display)

Sounds reasonable.

dates (such as date photo was taken) not matching what was reported when uploading to Commons

That's an interesting one. Sounds reasonable.

discovering files with odd EXIF elements showing the files may have other compressed files hidden within it (to perhaps hide a file within a file for illicit sharing purposes)

A security feature. Interesting! There may be better ways to solve this use-case, such as checking this data at upload time. Perhaps that already happens. Regardless, that may be useful.

It sounds like it'd be very useful to expose EXIF data with some structure on e.g. special page, but not sure yet I can see how it fits the search story.

Raymond rescinded a token.
Raymond added a subscriber: Raymond.
CKoerner_WMF added a subscriber: CKoerner_WMF.