Page MenuHomePhabricator

Allow ContentHandler to expose structured data to the search engine.
Closed, ResolvedPublic

Description

Status approved by Tim Starling in 2015-03-04 IRC RFC discussion, requesting some followup (see task comments).

It would be useful to give the ContentHandler (resp. the Content object) for a specific content model the ability to control what fields are exposed to the search engine index. For wikitext, this would be the main "text" field, but could also include "html" for rendered html, "links" for outgoing links, etc. For Wikibase entities, this would include fields like "label" (which would be a multi-lingual field), "alias", "description", "sitelink", "property-value", etc.

Currently, Content::getTextForSearchIndex() exposes flat text to the search index, assuming word based full text indexing is applied.

This RFC proposes to add the following:

Content::getFieldsForSearchIndex(): This would return an associative array mapping field names to index values. The type and structure of the index value must correspond to the type of the field as declared by getSearchIndexFieldDefinitions, see below. At least the "text" field should be returned. It would be populated by calling the old getTextForSearchIndex() method.

ContentHandler::getSearchIndexFieldDefinitions(): This returns a list of SearchIndexFieldDefinition objects, representing the fields that Content::getFieldsForSearchIndex() may return for the handler's content model. This information should be used by the search engine when defining indexes. TextContent would implement this to return a definition for the "text" field, defining it to be plain text eligible for word-based full text indexing.

ContentHandler::getAllSearchIndexFieldDefinitions(): Static methiod that calls getSearchIndexFieldDefinitions() on all registered content handlers, and combines the results. If two content handlers declare the same fields with a different type, an exception is thrown.

class SearchIndexFieldDefinition: This is a value object with the following methods:

  • getName(): returns the field name. Fields with the same name may be used by different content models, but they must have the same declaration.
  • getIndexType(): returns the index type (see below)
  • isMultiValue(): returns true if the field is a list of values of the said type.
  • getWeight(): (between 0 and 99? or between 0 and 1? Or between 0 and PHP_INT_MAX?) (wouldn't recommend PHP_INT_MAX, since it can vary by build and may change in the future).

Index types:

  • INDEX_TYPE_TEXT: String. Allow (word based) full text search if possible.
  • INDEX_TYPE_MULTILINGUAL: Associative array of language code mapping to INDEX_TYPE_TEXT values. Allow (word based) full text search if possible.
  • INDEX_TYPE_IDENTIFIER: String. Allow prefix matches if possible.
  • INDEX_TYPE_QUANTITY: Signed float. Allow range queries of possible.
  • INDEX_TYPE_GEOPOINT: A pair of longitude and latitude, represented as floats. Allow special queries if feasible.
  • INDEX_TYPE_DATETIME: A timestamp (in a format wfTimestamp understands). Allow range queries if possible.

Search engines may ignore fields that have unsupported types, or may treat values of such types as plain strings or text.

So, for text, TextContentHandler::getSearchIndexFieldDefinitions() would return

array(
  new SearchIndexFieldDefinition( 'text', INDEX_TYPE_TEXT )
)

And TextContent::getFieldsForSearchIndex() would return

array(
  'text' => $this->getTextForSearchIndex()
)

(Alternatively, getTextForSearchIndex() would call getFieldsForSearchIndex())

TBD: In the next step, SearchEngine should be modified to make use of the new information. In particular, SearchEngine::getTextFromContent should be deprecated, and replaced by a getFieldsFromContent method.

TBD: To make full use of having multiple fields indexed for search, these fields should be accessible in the SearchResult. This ties in with Brion's proposal for SearchResult::getMetadata() T78011.

TBD: We may want to expose a "widget type" or "data type" that can be used to pick formatters or widgets for showing or inputting values for a field. These types would be related to, but distinct from, the index types.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The RFC was approved.
Notes from the IRC meeeting:

  • spagewmf and others suggest the rfc should say how it would change wikitext processing (DanielK_WMDE, 21:53:27)
  • manybubbles suggests the rfc should list examples of fields that could be exposed for different kinds of content (DanielK_WMDE, 21:53:52)
  • RFC approved by manybubbles, please continue with design/implementation (TimStarling, 22:00:58)
  • gwicke would prefer a metadata schema over a procedural interface (gwicke, 22:02:02)
  • propose a JSON binding for the field metadata (DanielK_WMDE, 22:04:39)

Afterthought: SearchIndexFieldDefinition should also declare in what context a given field is intended to be used. The following types of usage should be possible:

  • use for type-ahead prefix search, i.e. should pop up as a suggestion when typing in the top-right search box
  • use in default full text search, on Special:Search
  • advanced search, triggered e.g. by using a special syntax, like incategory:Foo or linksto:acme.com (TBD: how would the syntax be defined and localized?)
  • internal only, not exposed via the UI

More thoughts:

There should be a field exposing rendered HTML, provided per default

This may be expensive to generate, though. So perhaps, instead of getFieldsForSearchIndex(), we should be a methods called getFieldForSearchIndex( $name ). That way, the search engine could ask just for the fields it is interested in.

In the API, search index fields should be exposed as pageprops. Field info should be available via the siteinfo module.

getFieldsForSearchIndex and getSearchIndexFieldDefinitions should call hooks to allow extensions to provide additional fields for any content model.

Use case: the Geodata extension could expose geo-coordinates scraped from wikitext.
Use case: Wikibase could expose a list of data items used in wikitext.

Spage updated the task description. (Show Details)
Spage edited projects, added MediaWiki-ContentHandler; removed TechCom.

Example for exposing structured data in search: media files on commons.

Core could expose some meta-data like the file type, size and resolution as separate search fields (this is in high demand).
The CommonsMetaData extension could expose more metadata, like the license or creator.

In the API, search index fields should be exposed as pageprops. Field info should be available via the siteinfo module.

including field info in the siteinfo module sounds good

regarding pageprops, do you mean include all field_name (e.g. source_text) => value ("{{Infobox kerk | naam = Dorpskerk | afbeelding = Dorpskerk-Bloemendaal.JPG | onderschrift = ....") that are indexed by search in page props? don't think that's a good idea at all or maybe I don't understand...

More thoughts:

There should be a field exposing rendered HTML, provided per default

This is already generated in Cirrus, but not indexed as-is.

then the tags are stripped and "auxillary" content (e.g. captions, infobox, authority control template, etc. contents) are indexed / weighted separately.

This may be expensive to generate, though.

we already generate this :) suppose it's a question if we really want to store and index the full html (without stripping tags), in addition to the version with stripped tags. For elastic, stripped is better unless we want an addition mode (like insource: for wikitext) for full html.

So perhaps, instead of getFieldsForSearchIndex(), we should be a methods called getFieldForSearchIndex( $name ). That way, the search engine could ask just for the fields it is interested in.

not sure about this. also, would be nice to have some standardization of field names when/where it makes sense for things that can be handled in a common way in search, across multiple content models.

regarding pageprops, do you mean include all field_name (e.g. source_text) => value ("{{Infobox kerk | naam = Dorpskerk | afbeelding = Dorpskerk-Bloemendaal.JPG | onderschrift = ....") that are indexed by search in page props? don't think that's a good idea at all or maybe I don't understand...

Note sure yet how exactly, but I think the fields that get exposed to the search engine externally should also be accessible via the API. Maybe by asking for specific fields individually, using searchfield-foo|searchfield-source_text|...

More thoughts:

There should be a field exposing rendered HTML, provided per default

This is already generated in Cirrus, but not indexed as-is.

then the tags are stripped and "auxillary" content (e.g. captions, infobox, authority control template, etc. contents) are indexed / weighted separately.

I think some of these should be exposed explicitly via the ContentHandler, instead of scraping them from HTML. For some, we may still want scraping, or stripping, at the search engine's discretion.

So perhaps, instead of getFieldsForSearchIndex(), we should be a methods called getFieldForSearchIndex( $name ). That way, the search engine could ask just for the fields it is interested in.

not sure about this. also, would be nice to have some standardization of field names when/where it makes sense for things that can be handled in a common way in search, across multiple content models.

Yes, absolutely, there should be some "well known" field names for things like wikitext, html, sections, categories, etc.

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

This task is still missing clear Summit goals, topics that could be discussed here before, and active discussion. It would be good to sort out these problems before the next deadline on November 6.

@Qgil I have discussed this RFC with Aude, Stas and David quite a bit over the last couple of days, in the context of improving search for wikidata. Some notes can be found at https://etherpad.wikimedia.org/p/Wikidata_Meeting_Berlin_10262015 (but mixed in with notes from other sessions). I'll try to update the RFC soon with the relevant info from the notes.

Not sure if this is in the scope of the RfC/summit proposal, but one thing to think about is how to expose structured data in media files. The problem there is that files are currently not really considered content (as far as ContentHandler is aware a file page is just wikitext). At present CirrusSearch handles files customly (and poorly) via FileDataBuilder, but it would be nice to reuse the same interface.

Status update: After some discussion and experimentation, it seems that it's rather hard to define the field declarations in an engine-agnostic way. It's probably best to just ask the ContentHandler for the field declarations, and field data, for a specific search engine, such as elastic.

Also, it's a bit unclear if the field declarations can also be used to generate option/filter fields on Special:Search, or to determine which fields should be used for complextion searchs (for the on-page navigational search box).

@Smalyshev listed this as a nominee for a "must have" in T119593, and it's currently categorized in T119029: WikiDev 16 working area: Content access and APIs, but doesn't seem to be on track as a must-have conversation. @daniel + others that are interested in this: can you make the case for this one in T119029?

@RobLa-WMF T117548 is related and is what i'm working on towards getting stuff implemented. as well, there are related issues like how to get the appropriate display text (e.g. labels for some content types) in places like the search results page or search suggestions.

i hope to avoid too many hacks in cirrus and wikibase, and would like to figure out how we could generalize this and integrate better search in core for multiple content types.

see T117520 regarding the more specific task of indexing Wikibase labels, descriptions and aliases in cirrus / elastic, and how to handle multilingual content.

Paste of my somewhat limited notes:

Katie (aude): I work for Wikimedia Germany on Wikidata. Our search isn't really very good, and I'm excited to talk to the search folks about how to make it better.
Katie: If you're on mobile and you search for "Barack", you get zero results. You need to search for Q76, which isn't very helpful!
Katie: There's multiple search APIs in core. This particular one is prefixsearch, which is why you need to know the title. Because of Wikidata IDs, that's not helpful for users.
Katie: Now, when I'm on desktop, and I search for "Obama", and it works, because it's using the terms table. If I'm faster than the Javascript, then it can take me to special search!
Katie: The search box at the top right on Wikipedia, that uses opensearch. We want to show the label instead of the title, which is hard. There's a lot of different places to fix this in core.
Deskana: So if you hit enter too fast, then it takes you directly to Special:Search?
Katie: Yes.
Katie: If you do an fulltext search for "life", then the "life" item is not one of the top results.
Leila: Do you think it should show the exact match?
Tomasz: If there is an exact match, it will take you there.
Leila: Is it better to show the user the list, or take them to what you think is right?
Tomasz: Some users find value in the list.
Katie: There's some hidden features in Cirrus which let you see how search is indexing things. ?action=cirrusdump
Katie: Part of this is the way that Wikidata exposes data to Cirrus. With "half-life" the name in most languages is "half-life" so the text that's sent to search has the word "life" in it much more than the "life" item does. This affects the ranking.
Tomasz: How are you measuring your user satisfaction for people on search?
Katie: We're not really.
Tomasz: Is it just user feedback?
Katie: Yes, mostly.
DanielK: It's just broken.
Tomasz: In some cases yes.
Dan: Sometimes you pull a valve and one thing gets better and something else gets worse. Your small sets of searches that are obviously broken might get better, but it might make others worse. This is why broader metrics are useful.
JoshM: Yes, flattening the entire thing into a string is probably part of the problem. If it could be annotated then it might be better because you could weight bits differently.
Katie: We want to get the data into a structured format.
JoshM: There's a lot more autocompletion done in Wikidata. When you're filling in a statement, it can autocomplete the item name. That doesn't really happen on Wikidata.
DanielK: The question is, do we need a separate index or is one enough?

[ At this point, the discussion turns into a working session in which Katie, Erik, and David and others begin talking about lots of technical details which the scribe is not good at taking notes for :-) ]

DanielK: What does "full text search" mean for Wikidata? There is no text.
Deskana: Perhaps we should just scrap fulltext on Wikidata? And make it the same as the typeahead?
Erik: Do you know what queries people are typing in to the search box?
Katie: No
Erik: We can show you how to pull that data out of Hadoop

David: There is a lot of refactoring that needs to happen in the Searcher class in Cirrus to make custom mappings possible.
Tomasz: We're going to need to do this for things like media and image search any way.
David: Yes, it's complicated.
David: We also need to take namespaces into account.
Katie: Yes, if you're searching for "Help:Contents" are you looking for the Wikidata item about that page, or the page itself?
DanielK: Need to indicate to the user that both exist somehow.
Stas: There's nothing in the title that can't also appear in the label.
Tomasz: Who's this meant to support? Individual users? Bots? These have different profiles and different needs.
Katie: Meant to replace the search box.
Tomasz: Sounds like individual users.
DanielK / Stas: Yes.
DanielK: Also want to support people searching for things like template parameters. But that doesn't work right now.
Deskana: Interesting, intitle: (advanced search syntax) works on Wikidata, in spite of the fact that it really shouldn't!
DanielK: That's odd, because the way it's working really doesn't search the title in any sense.
Deskana: Yeah, it shouldn't work, but it does!

DanielK: Page views isn't too helpful for Wikidata. Page rank might be more helpful. Or, page views of the pages linked on the individual Wikipedias?
Deskana: That's a loooot of work. But sounds interesting.
DanielK: Yes!
Stas: Perhaps we could use the page dumps?
Erik: Calculating it isn't that hard, the algorithm is well studied and there are lots of implementations. Getting your data into the graph format is hard.
DanielK: And keeping it up-to-date, because you need to recompute the whole thing.
David: Yes

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

I've looked into how we map now in CirrusSearch and we have these broad types:

  • Date - a match for INDEX_TYPE_DATETIME
  • Integer - may be a match for INDEX_TYPE_QUANTITY, but I think we need separate types for floats and integers. ElasticSearch has support for both.
  • String - that would be INDEX_TYPE_TEXT.
  • Keyword string, which disables analysing, variant: case-folded keyword string. Probably INDEX_TYPE_IDENTIFIER?
  • Composite field - e.g. redirect is namespace+title, where namespace is long and title is string.
  • geo_point - obviously INDEX_TYPE_GEOPOINT.
  • We do not seem to have anything like INDEX_TYPE_MULTILINGUAL and it may be hard to do as analyzers would be different for different languages I imagine. We do have ability to have different subfields in ElasticSearch, but not sure it'd be OK with 800 subfields.

We also have a bunch of fields with custom configurations for ElasticSearch, such as analyzers, options, etc. Very frequently used are index options: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-options.html

Also very frequently used are subfields with different analyzers than the main field. Many of these definitions are similar, but defined in ElasticSearch-specific way, so we need some way to define engine-specific options.
The base for it may be functions buildKeywordField(), buildLowercaseKeywordField(), buildLongField(), buildStringField() in MappingConfigBuilder.php.

We also must ensure namespacing - extensions should not create fields with the same name as existing fields.

getWeight() may be too simplistic - at least for ElasticSearch, there are more tweaks to determine relevancy, and index-time boosting is officially called "bad idea" in the manual: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-boost.html
I would consider wieghting be part of the query definition and drop it from the index field.

If we reverse it and as for field data for specific engine, this implies we have to have a lot of knowledge about the engine inside the specific extension. It looks like for most cases, this knowledge is not strictly required, as while particular field definition can be very complex, fields do group into a number of large buckets with similar tweaks to them.

I think we need generic interfaces, such as FieldDefinitions and Field, and then maybe it's okay to have more specific implementations for specific search engines (e.g. ElasticSearch)

for multilingual, ElasticSearch supports nested fields (as done with coordinates) but those can be problematic, especially with large nestings.

sub 'fields' tends to be for indexing the same content with multiple analyzers (e.g. asciifolding, etc. and also with not analyzed or different analyzer).

For Wikibase, I am experimenting with unnested fields for multilingual content, though then we have hundreds of unnested fields and not sure there is some point where it's too many?

ElasticSearch also supports dynamic templates (though disabled for Cirrus). I think multilingual content might be a good use case for dynamic templates, so have 'label_*' and then a generic mapping for labels that applies to fields that match that pattern. If we want to use some of the language analyzers, then we could specifically define mapping for some of the languages and then use dynamic template for the rest.

agree about dropping getWeight

for multilingual, ElasticSearch supports nested fields (as done with coordinates) but those can be problematic, especially with large nestings.

Exactly. I'm not sure hundred-item nested field would do well. What is the use case for multilingual fields? Maybe it's not as common and can be done via other fields.

I think we need generic interfaces, such as FieldDefinitions and Field, and then maybe it's okay to have more specific implementations for specific search engines

Well, somebody will have to implement it. So if we need field for ElasticSearch, the knowledge about this should be either in CirrusSearch or in specific extension, which would then have to know about all search engines. I'd like to avoid that, at least as much as possible - i.e. if I have to give specific Cirrus-related hints in extension, ok - but at least I don't have to copy whole index mapping creation logic. If I can just give it a flag and have Cirrus translate a flag into full-blown config - that is preferable.

For Wikibase, I am experimenting with unnested fields for multilingual content, though then we have hundreds of unnested fields and not sure there is some point where it's too many?

Since Elastic says nested field is separate document anyway (https://www.elastic.co/guide/en/elasticsearch/reference/2.3/nested.html towards the end) the question is what is better - having a lot of fields in the same doc or separate one. We'll need to check that.
Maybe @dcausse can help with it?

ElasticSearch also supports dynamic templates (though disabled for Cirrus).

Well, I'm not sure internally it makes any difference if it happens on index creation stage - we create index once, so if definition is huge it's not a problem. Runtime performance is what is the concern here.

For Wikibase, I am experimenting with unnested fields for multilingual content, though then we have hundreds of unnested fields and not sure there is some point where it's too many?

Since Elastic says nested field is separate document anyway (https://www.elastic.co/guide/en/elasticsearch/reference/2.3/nested.html towards the end) the question is what is better - having a lot of fields in the same doc or separate one. We'll need to check that.
Maybe @dcausse can help with it?

Drawbacks with nested fields is that, like you said it'll create one subdocument per nested field. Then for performance reasons a bitset will loaded into ram to join parent and child.
Note that nested fields won't allow you to set a specific analyzer for a specific language. The only advantage of nested fields in this case is that it allows you to manage the list of supported languages without any mapping change. You'll be able to add a new language and still query like this : "label.text:HIQaH AND label:lang:KLINGON".
Word frequencies will still be mixed and you'll have to analyze all the languages with the same analyzer.

This mapping question is really tough, ideally we'd like to:

  1. have proper scoring for a specific language: if I'm french I want to properly weights french words against french content, I don't want to decrease the weight of a term because it's popular in another language: If I search for Thé in french I don't want to decrease its weight because it's a common word in english.
  2. have proper analysis for a specific language: I don't want a word stemmed by a french analyzer to collide with another language stem.
  3. I want to boost french content if I'm french and possibly additional languages in the same query.
  4. I'd like to change language boost values at query time (if we rely on index time boosting techniques we will never be able to tune the system)
  5. I want to also boost a particular field over another: I'd like to have documents that match labels first and then those that match description. In other words: lang boost needs to be combined with field type boost.
  6. I want to add query independent factors in the scoring formula (number of statements/label/any particular metadata)

Nested field will probably break 1 and 2. 3 will be OK but sub-optimal. Considering that it also has some perf drawbacks I don't consider it as a good fit here (unless adding new languages in real-time is a blocker).

At this stage I'd continue to investigate with the solution proposed by @aude (unested subfields). Too many subfields is generally not a good idea but I don't see another option that fits the requirements above.

One optimization that could work is to use a kind of "allfield" that will be used only for fast filtering, the language subfields will then be used for scoring. With a filter like that the query will be slightly simpler since you'll be able to use simple disjunctions. I havn't thought about all the implementation details but I suppose something like https://github.com/yakaz/elasticsearch-analysis-combo could help to build such field (as long as it's used only for filtering).

Change 288567 had a related patch set uploaded (by Smalyshev):
[WIP] Use structured fields API to build mapping

https://gerrit.wikimedia.org/r/288567

Change 289021 had a related patch set uploaded (by Smalyshev):
[WIP] Make content handlers assemble content for search

https://gerrit.wikimedia.org/r/289021

Change 289115 had a related patch set uploaded (by Smalyshev):
[WIP] Make content handler produce field data

https://gerrit.wikimedia.org/r/289115

Change 288559 had a related patch set uploaded (by Smalyshev):
[WIP] [DNM] Create API to allow content handlers to handle structured data definitions

https://gerrit.wikimedia.org/r/288559

Current implementation slightly modifies the model above, namely:

  • There is an interface SearchIndexField which has public function getMapping( SearchEngine $engine ). This is the main way how the search engine mapper gets the concrete mapping.
  • The interface also allows to set flags on the field, but not much more.
  • The concrete objects for field mappings are produced by the search engine via SearchEngine::makeSearchFieldMapping( $name, $type ) if they are of standard types listed in or can be generated by extension or content handler from local classes that may do whatever they like as long as they implement SearchIndexField.
  • Note that content handler knows for which engine it generates field, so it can make per-engine decisions.
  • There's standard class SearchIndexFieldDefinition that provides some tools for defining concrete search fields. Search engine extends this class to provide specific mappings.
  • There's NullIndexField class which any implementation can use to produce a field that will be ignored in final mapping. This is introduced to make code easier.
  • Significant part of text extraction for search indexing is moved from CirrusSearch to WikitextStructure.
  • Content for indexing comes from ContentHandler::getDataForSearchIndex( WikiPage $page ).
  • Hook SearchDataForIndex allows extensions to supply fields data in addition to getDataForSearchIndex.
  • Support for nested fields TBD, but should not be hard.

Open questions:

  • Which hooks we will need and where?
  • No support for multilingual fields yet.
  • No support for INDEX_TYPE_GEOPOINT as I'm not sure if we need built-in type for that or just a regular nested field would do. GeoData is definitely more complex than simple geopoint so maybe having this as built-in type would not help much.

Change 294403 had a related patch set uploaded (by Smalyshev):
Cleanup code that has been moved.

https://gerrit.wikimedia.org/r/294403

Change 294381 had a related patch set uploaded (by Smalyshev):
Add search-ignored-headings string, copied from cirrus-search-ignored-headings.

https://gerrit.wikimedia.org/r/294381

Change 292490 had a related patch set uploaded (by Smalyshev):
Add nested field support

https://gerrit.wikimedia.org/r/292490

Change 294411 had a related patch set uploaded (by Smalyshev):
Add nested field support

https://gerrit.wikimedia.org/r/294411

Change 292490 abandoned by Smalyshev:
Add nested field support

Reason:
oops, somehow duplicated. The right one now is I82a82526e2e254edc1fa7d861d7ac23d9cf07d1c

https://gerrit.wikimedia.org/r/292490

Change 288559 merged by jenkins-bot:
Create API to allow content handlers to handle structured data definitions

https://gerrit.wikimedia.org/r/288559

Change 294411 abandoned by Smalyshev:
Add nested field support

https://gerrit.wikimedia.org/r/294411

Change 288567 merged by jenkins-bot:
Use structured fields API to build mapping

https://gerrit.wikimedia.org/r/288567

Change 289021 merged by jenkins-bot:
Make content handlers assemble content for search

https://gerrit.wikimedia.org/r/289021

Change 289115 merged by jenkins-bot:
Make content handler produce field data

https://gerrit.wikimedia.org/r/289115

Change 294403 merged by jenkins-bot:
Cleanup code that has been moved.

https://gerrit.wikimedia.org/r/294403

I think this is all done now.