Page MenuHomePhabricator

[Story] Expose image meta-data to the SearchEngine
Closed, ResolvedPublic

Description

It would be useful to be able to search files by meta-data. For instance:

  • size:<200kb to find files smaller than 200kb
  • type:video to find only videos
  • mime:image/png tp find only png files
  • resolution:>800 to find files with sqrt(width*hight) >800

In order to achieve this, additional fields should be exposed to the SearchEngine, based on information provided by the File object and MediaHandler associated with a page in the file namespace.

Note: Indexing files is currently bound to the WikitextContentHandler. It would be nice to have a way to define search index fields independently of the content model, perhaps based on page type (article, category, image) or namespace.

For now, it would be sufficient to add the desired information in WikitextContentHandler::getDataForSearchIndex and getFieldsForSearchIndex, in the same way the file_text field is defined and populated.

See also: T101089: [GTWL] Epic: Search for images by colour, size and format

Event Timeline

daniel created this task.Aug 31 2016, 9:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Restricted Application added subscribers: Poyekhali, Matanya. · View Herald TranscriptAug 31 2016, 9:51 PM

I think we already have size, but the rest is probably not indexed.

I think we already have size

I don't see size in WikitextContentHandler. Where does it come from?

matmarex added a subscriber: matmarex.

All of the metadata you mentioned is already exposed in the File class (getSize(), getMediaType(), getMimeType(), getWidth(), getHeight()). All of it is also in the 'image' table (img_size, img_media_type, img_major_mime/img_minor_mime, img_width, img_height). I don't think there's anything to do on the multimedia side here.

Tgr added a subscriber: Tgr.Aug 31 2016, 11:13 PM

Duplicate of T15370?

I don't see size in WikitextContentHandler. Where does it come from?

From ContentHandler:

$fieldData['text_bytes'] = $content->getSize();

I am assuming it's the same size? (need to check)

Duplicate of T15370

Possibly :)

Tgr added a comment.Sep 1 2016, 12:09 AM

Also sort-of duplicate: T78490: Image search by file size

ContentHandler probably returns the size of the image description page; file handling happens completely outside of it.

Yeah looks like text_bytes is not what we want, see e.g.: https://en.wikipedia.org/wiki/File:ZZ_Top_Stages.jpg?action=cirrusdump

I'm thinking maybe it's time to think about separate content handler that handles files...

daniel added a comment.Sep 1 2016, 9:57 AM

I don't think there's anything to do on the multimedia side here.

I agree. I added the file-handling tag as a heads-up.

I'm thinking maybe it's time to think about separate content handler that handles files...

Would be nice, but files are not "Content" in the MediaWiki sense. Once we have MCR, we could make them Content, but that will be a while.

Hm... maybe we can define FileContent and FileContentHandler, but not use them for any pages directly for now. WikitextContent/Handler would create and use them internally when on a file page.

maybe we can define FileContent and FileContentHandler, but not use them for any pages directly for now. WikitextContent/Handler would create and use them internally when on a file page.

I was just thinking there would be a bunch of file-specific code that is not really in a good place in WikiTextHandler as it has nothing to do with wikitext. So yes, I think this could be a good way.

debt triaged this task as Normal priority.Sep 1 2016, 10:08 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt added a subscriber: debt.

Anyone (mostly within Discovery) can add in keyword based filters for this type of search filtering if that is what is needed.

Note: this would require a full re-index which would mean it needs to re-parse the data to incorporate the new data.

Smalyshev renamed this task from expose image meta-data to the SearchEngine to [Story] Expose image meta-data to the SearchEngine.Sep 20 2016, 7:09 PM
Smalyshev moved this task from Needs triage to Search on the Discovery board.

We're just waiting for a reindex on Commons to finish this up.

dcausse added a subscriber: dcausse.

The file index for commonswiki has been reindexed and the new search keywords can now be be used on commonswiki: e.g. file:filemime:image filesize:>1000000.

Deskana closed this task as Resolved.Nov 14 2016, 7:08 PM

Yay! I've added this to the upcoming weekly update for Discovery.

The documentation for what can be searched is here: https://www.mediawiki.org/wiki/Help:CirrusSearch#File_properties_search