Index certain statements for Wikidata items
Closed, ResolvedPublic
Actions

Description

In order to achieve better relevancy, especially in item suggesters, it would be nice to index certain statements for select properties, such as P31 (instance-of), etc.
This would allow to boost/de-boost certain classes (like disambig pages or templates) when searching for items, and get more relevant results.

Current plan:

Add configuration that allows to specify which properties to index (by P-id)
The index mapping creates a keyword field for each of these
The value is indexed as single string, for entities that would be Q-id or P-id, for quantities - main value. TBD: what to do with complex types like coordinates.
Qualifiers, references, ranks, etc. will be ignored for now
- Maybe with possible exception of excluding deprecated rank in next iteration?
Develop a way to boost/de-boost certain things using this information (will be in a separate task)

Patch-For-Review:

Initial config indexes P31 and P279. More can be added on request (requires full reindex, so can take time).

Details

Subject	Repo	Branch	Lines +/-
Bind against FieldDefinitions interface instead of implementation	mediawiki/extensions/WikibaseLexeme	master	+8 -8
Bind against FieldDefinitions interface instead of implementation	mediawiki/extensions/WikibaseMediaInfo	master	+4 -4
Add configuration for statement indexing for Wikidata	operations/mediawiki-config	master	+4 -0
Make Item… and PropertyFieldDefinitions accept arrays	mediawiki/extensions/Wikibase	master	+35 -57
Optimize StatementsField for performance and readability	mediawiki/extensions/Wikibase	master	+137 -158
Add script to search entities from command line	mediawiki/extensions/Wikibase	master	+183 -0
Enable indexing statements on items	mediawiki/extensions/Wikibase	master	+393 -11

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Wikidata-bugs	T77898 query taking 10s: TermSqlIndex::getMatchingIDs
Open	None	T46529 Wikidata search problems (tracking)
Resolved	Smalyshev	T148411 Item search for statements ranks disambiguation items too highly
Resolved	Smalyshev	T78157 [Story] Use ElasticSearch for entity search on wikidata.org
Resolved	Smalyshev	T175199 Index certain statements for Wikidata items

Event Timeline

Smalyshev created this task.Sep 6 2017, 6:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2017, 6:37 PM

Smalyshev added parent tasks: T148411: Item search for statements ranks disambiguation items too highly, T78157: [Story] Use ElasticSearch for entity search on wikidata.org.Sep 6 2017, 6:41 PM

One worry i have is about over-creating fields. If we are talking about 5 relationships then maybe it's no big deal, but if we want to capture many different relationships, both in wikidata and eventually in structured data on commons, i wonder if we could rather have some sort of relationship (name tbd) keyword field that encodes both parts. Perhaps we could encode the direct properties of an entity with a full description, say Q229331 (Muse) could have a relationships array populated with:

P31:Q4167410
P1889:Q16877643

Then filters for disambigation pages would put a query on relationship: P31:Q4167410. This of course cannot possibly encode all the possible relationships, especially qualifiers, but it seems a plausible step to more generalized direct-relationship (non-graph) filtering?

i wonder if we could rather have some sort of relationship (name tbd) keyword field that encodes both parts

That would depend on whether we could use such things for boosting/de-boosting. If yes, this certainly could be a way to go. That, however, makes it harder to do queries like "has P31" but maybe it's ok.

cannot possibly encode all the possible relationships, especially qualifiers

I intend to ignore qualifiers for now. I planned to add this to task desc and forgot, thanks for reminding!

Smalyshev updated the task description. (Show Details)Sep 6 2017, 7:00 PM

In T175199#3585954, @Smalyshev wrote:

i wonder if we could rather have some sort of relationship (name tbd) keyword field that encodes both parts

That would depend on whether we could use such things for boosting/de-boosting. If yes, this certainly could be a way to go. That, however, makes it harder to do queries like "has P31" but maybe it's ok.

I think we can come up with an analysis chain that will split on the : such that we can query a separate field (relationship.pieces? i dunno) for P31 or Q4167410 if we don't care about what the exact relationship is, just that it exists. We could certainly use this sort of thing for boosting/deboosting, it would probably be another constant score query with an appropriate filter set to provide the boost/deboost when the relationship exists.

cannot possibly encode all the possible relationships, especially qualifiers

I intend to ignore qualifiers for now. I planned to add this to task desc and forgot, thanks for reminding!

Also might need to check with @dcausse about plausibility, i imagine the cardinality here will be much higher than a normal field which could potentially cause issues, but might also be "not a problem". I'm not sure.

I wonder also, is it possible to do the (de)boosting on rescore stage? The reason is because we can select different rescore profiles from URL (which means different widgets can use different boosts) while getting stuff added to the search query itself is more complicated. Of course, we can add more query params or query syntax, but it seems to be for tuning profiles may be easier to do?

deboosting can happen in the rescore stage, since we use a weighted sum we can either apply a negative weight when relationship:P31:Q4167410 or a positive value when NOT relationship:P31:Q4167410.
Will we add all properties or just a set of selected properties?
Concerning cardinality of this new field it's hard to judge but I'm in favor of not over-indexing, in this case I'd be for a simple mapping like:

relationship: {
   "type": "keyword"
   "fields": {
       "type": {
               "type": "text",
               "analyzer": "split(':')[0]",
               "search_analyzer": "keyword"
       }
   }
}

In other words for P31:Q4167410 I'd keep only P31:Q4167410 and P31 as indexed terms, imo id does not make sense to index Q4167410 separately.

One possibility to avoid reindexing from mysql every-time we want to add a new property would be to create a custom analyzer where we provide a white list of properties to index.
All properties would present in the source doc but just a few selected ones would be indexed. Adding a new property would just require to update the analysis chain and perform an in-place re-index.
We then need to carefully monitor disk and terms in mem usage when whitelisting new props. Having all relationships in the source can make experimenting with relforge a bit easier, you'll just have to prepare the analysis chain on relforge and send a remote reindex api call.

This will help out with SDC General as well.

Smalyshev claimed this task.Sep 7 2017, 6:48 PM

Smalyshev added a project: User-Smalyshev.

Smalyshev moved this task from Backlog to Doing on the User-Smalyshev board.

Change 376645 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] [WIP] Index statements on items

https://gerrit.wikimedia.org/r/376645

gerritbot added a project: Patch-For-Review.Sep 7 2017, 11:13 PM

@dcausse Could you explain a bit more how to set up the analyzer? I tried to figure how to do it but I'm not sure whether I did it right.

I think the analyzer was just pseudo code, to actually make it happen you need something like this: https://phabricator.wikimedia.org/P5975

That script outputs at the end

{
  "relationships": [
    "P1:Q1234",
    "P31:Q54321",
    "P31:Q7654"
  ],
  "relationships.properties": [
    "P1",
    "P31"
  ]
}

@EBernhardson yes, this looks like what I've done in the patch, I just wondered if it's correct. Looks like it is then :)

I suppose if we want to send all the properties to elasticsearch, but only have it index specific ones we can apply the keep words token filter to relationships.properties, i'm not seeing anything obvious for relationships itself. I thought pattern match might be able to, but i'm not able to convince it in a small test case.

maybe custom analysis components in the extra plugin would make this easier?
Unless we have some objections to making wikibase dependent on the wmf elastic plugins?

It's possible to hack something together by using pattern capture filter to either capture the letter P, or capture the full line if the P-id is one we accept. Then add a stop words filter to strip out the P tokens. TBH that's pretty messy though: P5976

Provided the relationships ["P31:Q54321", "P1:Q1234", "P31:Q7654", "P42:Q4444"] and a keep for P31 and P42 this returns:

{
  "relationships.properties": [
    "P31",
    "P42"
  ],
  "relationships": [
    "P31:Q54321",
    "P31:Q7654",
    "P42:Q4444"
  ]
}

I'm not sure we should really go as far as indexing all statements, now. Most of them would not be very useful for the search purposes for now, and already served by Query Service. Most useful ones would be those that are legitimately limit the searches for relevant items, which I would imaging mostly are P31/P279. In fact, right now I don't even have much of a use case for using anything but those two, but maybe we'd have it in the future. I think maybe it'd be ok for now yo just index those explicitly mentioned. The idea of using analyzer/filters may be still workable in the future, but I'd postpone it for now.

In the patch, there was an option raised to index all statements of certain type, instead of just named properties (e.g. for something like T99899). I am not sure yet whether it is a good idea or not, need some thought. Probably not in the initial iteration, but possibly later.

Smalyshev added a subscriber: aude.Sep 12 2017, 5:24 PM

I like the idea to bind the elastic property to the type of the statement.
For now writing a mapping with default elastic tools allows to index nothing or everything, filtering must be done on the php side like you did in the current patch.
Moving the filtering to the mapping (which I'll find more flexible in the future) will require some custom mapper/analyzer.
I guess the question is do we care about filtering? Couldn't we just index all statements of a given type? I think this deserves some evaluation first, I'm not too keen indexing bazillions of terms while only 0.1% of them would be useful.
Maybe for now it's ok to start with filtering few properties on the php side, we can reconsider how we want to approach this problem a bit later.
But for me the most important now is to make clear that the elastic field we index is typed, e.g. do not add a new field like "wb_property", I'd prefer something like "wb_relationships".

Moving the filtering to the mapping (which I'll find more flexible in the future) will require some custom mapper/analyzer.

Right. That's why I prefer to postpone it for now. It's not required for immediate use cases and we can always add it later.

But for me the most important now is to make clear that the elastic field we index is typed, e.g. do not add a new field like "wb_property", I'd prefer something like "wb_relationships".

Right now the field name is statements. I'm not sure whether we should add wb there (everything in that index is "wb", since it's on wikidata). What do you mean by "typed" though?

Right now the field name is statements. I'm not sure whether we should add wb there (everything in that index is "wb", since it's on wikidata). What do you mean by "typed" though?

I mean a name that bears the data types it stores, for me "statements" seems too generic, if for now you index "instance of" you'll have values of data types "item", now if you decide to add P1559 (monolingual text) we should not index it in the "statements" elastic field they'll require totally different analyzers (one is an identifier, the other is written language).
It's why I'd prefer to name elastic fields based on the data type the property is using so why not item_statements instead of statements?

now if you decide to add P1559 (monolingual text) we should not index it in the "statements" elastic field they'll require totally different analyzers (one is an identifier, the other is written language)

I don't currently plan to analyze values in any way, so for statements field they are indexed as keyword. That would be ok for some strings too (e.g. URLs, identifiers, and such) but of course not appropriate for full-text search if we ever want one. But I currently don't plan it yet.

why not item_statements instead of statements?

I see your point, but item_statements doesn't seem much better - first, it's not clear whether it is statements on items or statements having items as values, second, even now values can be any entity ID, not only item ID, and also may accept some strings too. So I agree maybe statements is not great, will think about which one would be better.

We may also want to store some values as non-indexed data, e.g. see T140131

Smalyshev moved this task from Up Next to Current work on the Discovery-Search board.Sep 14 2017, 11:15 PM

Smalyshev edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Smalyshev moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.

I've renamed it to statement_keywords. Hopefully it's better.

Change 339575 had a related patch set uploaded (by Daniel Kinzler; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add script to search entities from command line

https://gerrit.wikimedia.org/r/339575

daniel added a project: Wikidata-Former-Sprint-Board.Sep 20 2017, 10:30 AM

daniel moved this task from Proposed to Review on the Wikidata-Former-Sprint-Board board.

Change 382725 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Optimize StatementsField for performance and readability

https://gerrit.wikimedia.org/r/382725

thiemowmde updated the task description. (Show Details)Oct 9 2017, 3:26 PM

Change 376645 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Enable indexing statements on items

https://gerrit.wikimedia.org/r/376645

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)).Oct 9 2017, 4:01 PM

Change 383364 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] Bind against FieldDefinitions interface instead of implementation

https://gerrit.wikimedia.org/r/383364

thiemowmde mentioned this in rEWLE5acaa85425b9: Bind against FieldDefinitions interface instead of implementation.Oct 10 2017, 2:15 PM

thiemowmde updated the task description. (Show Details)Oct 10 2017, 5:06 PM

Smalyshev moved this task from Waiting/Blocked to Next on the User-Smalyshev board.Oct 10 2017, 5:27 PM

Smalyshev moved this task from Next to Waiting/Blocked on the User-Smalyshev board.Oct 10 2017, 9:12 PM

Change 383464 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add configuration for statement indexing for Wikidata

https://gerrit.wikimedia.org/r/383464

thiemowmde mentioned this in rEWLEcb57bd4da623: Bind against FieldDefinitions interface instead of implementation.Oct 12 2017, 7:27 AM

thiemowmde updated the task description. (Show Details)Oct 12 2017, 10:17 AM

Change 339575 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add script to search entities from command line

https://gerrit.wikimedia.org/r/339575

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)); removed MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)).Oct 12 2017, 1:00 PM

Smalyshev updated the task description. (Show Details)Oct 12 2017, 5:40 PM

Change 384047 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/WikibaseMediaInfo@master] Bind against FieldDefinitions interface instead of implementation

https://gerrit.wikimedia.org/r/384047

thiemowmde mentioned this in rEWBIaee65419db5f: Bind against FieldDefinitions interface instead of implementation.Oct 13 2017, 1:36 PM

Smalyshev moved this task from Waiting/Blocked to Next on the User-Smalyshev board.Oct 13 2017, 10:24 PM

Change 384516 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Make Item… and PropertyFieldDefinitions accept arrays

https://gerrit.wikimedia.org/r/384516

Change 382725 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Optimize StatementsField for performance and readability

https://gerrit.wikimedia.org/r/382725

Change 384516 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make Item… and PropertyFieldDefinitions accept arrays

https://gerrit.wikimedia.org/r/384516

Change 383464 merged by jenkins-bot:
[operations/mediawiki-config@master] Add configuration for statement indexing for Wikidata

https://gerrit.wikimedia.org/r/383464

Mentioned in SAL (#wikimedia-operations) [2017-10-16T18:18:34Z] <thcipriani@tin> Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:383464|Add configuration for statement indexing for Wikidata]] T175199 (duration: 00m 47s)

Smalyshev updated the task description. (Show Details)Oct 16 2017, 7:43 PM

Smalyshev updated the task description. (Show Details)

WMDE-leszek closed this task as Resolved.Oct 17 2017, 10:54 AM

WMDE-leszek moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.

Smalyshev mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Oct 17 2017, 5:29 PM

This is merged and the config is enabled, but not reindexed yet, probably will take several days until it's done, the wikidata index is huge.

Change 384047 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Bind against FieldDefinitions interface instead of implementation

https://gerrit.wikimedia.org/r/384047

thiemowmde mentioned this in rEWLEd7959ae43cbb: Bind against FieldDefinitions interface instead of implementation.Dec 19 2017, 3:08 PM

thiemowmde mentioned this in rEWLEac22eec5fc34: Bind against FieldDefinitions interface instead of implementation.Jan 10 2018, 5:24 PM

Change 383364 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Bind against FieldDefinitions interface instead of implementation

https://gerrit.wikimedia.org/r/383364

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.May 6 2019, 3:58 PM

Index certain statements for Wikidata itemsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Index certain statements for Wikidata items
Closed, ResolvedPublic
Actions

Related Objects
Search...