Page MenuHomePhabricator

ContentHandler should expose the content-model to search engines.
Closed, ResolvedPublic

Description

To allow searches to be restricted by content model (e.g. to javascript pages, or to geo-shapes), ContentHandler::getFieldsForSearchIndex() and getDataForSearchIndex() should expose the content model name to the search engine. A keyword like "contentmodel" or "pagetype" could be used to filter by it.

QUESTION: should we expose the internal distinction between content model and media type, between uploaded media and page content? Or should we rather hide that distinction?

Event Timeline

daniel created this task.Jan 26 2017, 2:31 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJan 26 2017, 2:31 PM
daniel updated the task description. (Show Details)Jan 26 2017, 2:35 PM
daniel removed Yurik as the assignee of this task.Jan 26 2017, 6:56 PM
daniel added a subscriber: Yurik.

Change 334412 had a related patch set uploaded (by Smalyshev):
Add content model indexing

https://gerrit.wikimedia.org/r/334412

When we have multiple content pages, we probably will have to put array there instead of one item, but ES supports it.

debt assigned this task to Smalyshev.

Change 334412 merged by jenkins-bot:
Add content model indexing

https://gerrit.wikimedia.org/r/334412

Yay for the patch!

@Smalyshev does this automatically introduce a keyword? Can I use content_model:javascript in the search box now?

Re multiple content objects (MCR): it's probably sufficient to make the model of the main content object searchable. Other slots tend to be auxilliary, like license info, categories, quality assessment, etc. We should index the content of all the slots, but the model of the other slots isn't so interesting.

zhuyifei1999 moved this task from Incoming to Backlog on the Commons board.Jan 27 2017, 4:35 PM

No, keywords are handled in different code (CirrusSearch, not core). I can add patch for that too.

As for indexing secondary slots, I think we need to have bigger discussion on how MCR will be indexed. Current indexing assumes each field has only one instance, but that may not be true for MCRs. Let's have a dedicated task and discuss it.

Smalyshev triaged this task as Normal priority.EditedJan 27 2017, 8:58 PM

Also I think to make it actually work you'd need a reindex (probably should be coordinated with other reindexes via T147505).

Change 334715 had a related patch set uploaded (by Smalyshev):
Add contentmodel: query feature

https://gerrit.wikimedia.org/r/334715

daniel moved this task from Inbox to Push on the User-Daniel board.

Change 334715 merged by jenkins-bot:
Add contentmodel: query feature

https://gerrit.wikimedia.org/r/334715

Deskana closed this task as Resolved.Feb 2 2017, 12:34 AM
Deskana added a subscriber: Deskana.

Neat!