Index Wikidata strings in statements in the search engine
Open, NormalPublic

Description

Currently the search engine of Wikidata indexes all strings in labels, descriptions and aliases, but not in statements. Take for example https://www.wikidata.org/wiki/Q219831 / https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump . So searching for "SK-C-5" (the inventory number) or "177124540" (the viaf id) won't return the item.

As a user I want to be able to search for a string that is somewhere on the item and find the item.

This task is about putting it somewhere in the output. Probably just append all the strings to the text output and ignore the language (like is done right now for the labels/descriptions/aliases) and improve on it later.

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 23 2017, 5:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt added a subscriber: debt.

This looks to be more wikidata than discovery search task at this time.

This looks to be more wikidata than discovery search task at this time.

"The Discovery Department of Wikimedia Engineering has the mission to make the wealth of knowledge and content in the Wikimedia projects easily discoverable. " and this is about search so it seems a bit weird to be that you just kicked it off this board. Isn't this department for all projects?

This looks to be more wikidata than discovery search task at this time.

"The Discovery Department of Wikimedia Engineering has the mission to make the wealth of knowledge and content in the Wikimedia projects easily discoverable. " and this is about search so it seems a bit weird to be that you just kicked it off this board. Isn't this department for all projects?

Yes, this team is for all projects, but I think that we'll need input from the Wikidata team first before we can implement it on our end.

aude moved this task from Backlog to Code review on the User-aude board.
aude moved this task from Code review to Backlog on the User-aude board.May 19 2017, 2:14 PM

Since it's about Elastic indexing, I think it should be on our radar, though @debt is right about getting input on it from Wikidata team.

On my side, I think indexing all strings is excessive and may be infeasible. But indexing specific properties - or even all external IDs? - may be ok. External IDs tend to be distinct enough from other text to be a decent search key. We can already index specific properties, but we'd need to specify the property to search by them (i.e. need to say we're looking up VIAF id) and will need special keyword for that.

We could also add content of external ID fields to "all" field, which would make it part of the search match, without specifying which ID is it. We need to see if it's really a good idea.

Thus, waiting for Wikidata team to weigh in about how to proceed.

External identifiers are definitely interesting. We have T99899, which might be best solved with Elastic? That would require indexing the external identifiers with the property. If that's not feasible it should also be doable with the SPARQL endpoint though.

That would require indexing the external identifiers with the property

I think it should be possible. The main question that remains is - do we want to search per-property (e.g. P214:1234 for VIAF ID 1234 specifically) or just something like externalid:1234 which would look through all external IDs?

I think it'd need to be per property. So something like "P345:tt0133093" would give you The Matrix and only The Matrix. There is probably very little overlap between identifiers that are basically random numbers and letters. But when it comes to account names on social networks for example you want to be sure that you get the right item and not another item who happens to have the same value but for a different social network.

@Smalyshev coming back to the strings. It's just like Commons. I don't use the local search. I use Google. I noticed https://www.wikidata.org/w/index.php?title=Q45962939&action=history and I'm pretty sure it's a duplicate.
The item has an image with the link to the source and the source has an inventory number (ГЭ-3836). I have to use Google ( https://www.google.nl/search?q=%D0%93%D0%AD-3836+site%3Awikidata.org ) to find the existing item so I can merge this. It's quite sad that I have to use an external search engine to find something on our site.

@Multichill just to be sure, if you could search for P217:ГЭ-3836, with this syntax, it would be fine? We may need to do some infrastructure work before this works properly, but it seems not too hard to implement.

Both "P217:ГЭ-3836" or just "ГЭ-3836" would be great.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Mar 1 2018, 11:08 PM
Smalyshev moved this task from Next to Backlog on the User-Smalyshev board.Mar 16 2018, 5:02 PM

Viaf part is probably covered by T99899

OK, so outside of external IDs covered by T99899: [Story] Looking up entities by external identifiers, which string properties we want to add to the index? I am still concerned all of them might be too much, but ready to hear other opinions.

Don't have a good answer but https://www.wikidata.org/w/index.php?title=Special:ListProperties/string&limit=500&offset=0 has a list of all of the current string properties.

Hmm 228 is not that bad... Let me see if I can get some usage stats.

Also just had a thought - this does not cover qualifiers and references of course. Do we want anything there or that is already WDQS domain?

Also just had a thought - this does not cover qualifiers and references of course. Do we want anything there or that is already WDQS domain?

I'd say for now let's leave them out.

OK, looking at current usage, there are only 21 string properties with more than 100K values. Looking at them in particular, the interesting ones are:

HomoloGene ID (P593) - probably should be external ID. There are more like this, with less usage.

Over a million usages:

  • page(s) (P304) - 15332300 items.
  • volume (P478) - 15288265 items.
  • issue (P433) - 13757879 items

These are mostly used for scientific articles and IMO useless for search. We may want to exclude them (not sure about volume/issue but if we want to do bibliographical searches we probably need to have more robust model anyway).

  • taxon name (P225) - 2480324
  • Commons category (P373) - 2122490

These might be actually useful for searches.
The rest have much lesser usage, and even though some of them may also be useless for searches, adding those won't be that big of a deal.

Also, I am a bit concerned about properties like Wikidata SPARQL query equivalent (P3921) - should we have size limits on property value? I don't want to have 2K of text in the index there, not because it would hurt the index (probably not) but because it's useless - nobody is going to search for such value.

OK, looking at current usage, there are only 21 string properties with more than 100K values. Looking at them in particular, the interesting ones are:

HomoloGene ID (P593) - probably should be external ID. There are more like this, with less usage.

Over a million usages:

  • page(s) (P304) - 15332300 items.
  • volume (P478) - 15288265 items.
  • issue (P433) - 13757879 items

    These are mostly used for scientific articles and IMO useless for search. We may want to exclude them (not sure about volume/issue but if we want to do bibliographical searches we probably need to have more robust model anyway).

I'd agree.

  • taxon name (P225) - 2480324
  • Commons category (P373) - 2122490

    These might be actually useful for searches.

Jep those sound like ones people will want to find in search.

The rest have much lesser usage, and even though some of them may also be useless for searches, adding those won't be that big of a deal.

Agreed.

Also, I am a bit concerned about properties like Wikidata SPARQL query equivalent (P3921) - should we have size limits on property value? I don't want to have 2K of text in the index there, not because it would hurt the index (probably not) but because it's useless - nobody is going to search for such value.

There is a size limit on string values already. I don't remember the exact limit right now. Or are you looking for something else?

There is a size limit on string values already. I don't remember the exact limit right now. Or are you looking for something else?

I was thinking about shorter limit - not sure it makes sense to look up something by whole SPARQL query... but maybe we should just exclude properties like this altogether.

Yeah let's leave them out for now.

Change 430277 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add capability to exclude properties from by-type index

https://gerrit.wikimedia.org/r/430277

Smalyshev moved this task from Next to In review on the User-Smalyshev board.May 2 2018, 9:58 PM
Smalyshev moved this task from Later to This Quarter on the Discovery-Search board.
Smalyshev moved this task from Backlog to Needs review on the Discovery-Search (Current work) board.

Change 430277 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add capability to exclude properties from by-type index

https://gerrit.wikimedia.org/r/430277

Hey @Smalyshev, when is this going to be live?

Smalyshev moved this task from In review to Doing on the User-Smalyshev board.May 8 2018, 8:20 PM

@Lea_Lacroix_WMDE we need to make configs that enable indexing (will be done next thing) and then we need to actually reindex. Reindexing takes several days, so I planned to do it immediately after the Hackathon, unless you need it sooner.

Also, right now we can only locate by haswbstatement:P123=SK-C-5. If we want to index data without attached property IDs, we need to add different field & analyzer to do that. Should we do it?

Change 431994 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add string and external-id types to indexing

https://gerrit.wikimedia.org/r/431994

Change 431994 merged by jenkins-bot:
[operations/mediawiki-config@master] Add string and external-id types to Wikibase indexing

https://gerrit.wikimedia.org/r/431994

@Lea_Lacroix_WMDE Also, for newly edited items it should be working as soon as wmf.3 is deployed. But for older items it will need reindex.

Mentioned in SAL (#wikimedia-operations) [2018-05-08T23:22:45Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:431994|Add string and external-id types to Wikibase indexing]] T163642 T99899 (duration: 01m 26s)

I have no deadline in mind, I was just wondering when to announce it, and if you or me should do it :)

I'll note here when the reindex is done, and then I guess you can announce :) In the meantime I can check that everything works smoothly with edited entries.

@Lea_Lacroix_WMDE Also, for newly edited items it should be working as soon as wmf.3 is deployed. But for older items it will need reindex.

Good to see all this progress! Will a purge of an item or an edit to an item trigger a reindex of that item? Will https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump show the "P123=SK-C-5" somewhere?

Yes, edit should show it.

Smalyshev added a comment.EditedMay 10 2018, 6:03 AM

@Multichill check out https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump - it has the data now.

Purge won't work though - you need to edit.

Mentioned in SAL (#wikimedia-operations) [2018-05-23T16:31:01Z] <SMalyshev> starting wikidata full reindex for T163642

Smalyshev closed this task as Resolved.Fri, May 25, 5:39 AM

@Smalyshev https://www.wikidata.org/w/index.php?search=%22SK-C-5%22 doesn't work yet, but this task has been closed. Can you explain? This is listed in the task description as something that should work.

@Multichill I think the point of this task were to index the statements, which is done. For searching, you can use haswbstatement for now. I am not sure whether it makes sense to copy the statement value into all field, where it would be then searchable by plain search too - may be useful for distinctive IDs but I am not sure how many of them are distinctive... I think it's better to make a separate task for this.

Hmm, I'm not sure this is all that useful at least as it stands. Most external id's can be as easily found now via the Wikidata Resolver tool - https://tools.wmflabs.org/wikidata-todo/resolver.php - However, what I would find useful would be a way to locate for example partial street addresses - this (P969) is often entered as a qualifier on headquarters location (P159). Searching for' haswbstatement:P969=Main' now finds something, but only because that oddly has just 'Main' as the value for P969, and making the string lowercase ("main") finds nothing, which is definitely not what I would expect on this... I don't think treating string values as if they were identifiers is the right approach, the usefulness of a search engine is in normalizing string values so you can find them without having the exact matching string. And qualifiers should be folded in somehow!

Multichill reopened this task as Open.Wed, May 30, 10:15 AM

@Multichill I think the point of this task were to index the statements, which is done. For searching, you can use haswbstatement for now. I am not sure whether it makes sense to copy the statement value into all field, where it would be then searchable by plain search too - may be useful for distinctive IDs but I am not sure how many of them are distinctive... I think it's better to make a separate task for this.

That was not my point of this task. https://www.wikidata.org/w/index.php?search=%22SK-C-5%22 should return https://www.wikidata.org/wiki/Q219831 . in my view the haswbstatement step is an intermediate one. Sorry for not being clear enough.

Taking a step back for a bigger overview. As a user I expect all the text I see on https://www.wikidata.org/wiki/Q21983 to be searchable as plain text. Currently we only index the labels, aliases and descriptions in the text (https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump). Also all the statements should be added. That includes the string statements, but also the labels of the used items. Looking at https://www.wikidata.org/wiki/Special:EntityData/Q219831.rdf I realize this would be quite an increase of the search data. Would probably make sense to have localized plain text fields like "text_nl" which only have the data in that language. Why do this? Trying to use our search to find something is hard. I was in the Thyssen Bornemisza museum and trying to use search to find the Van Gogh paintings on Wikidata. That's currently impossible. Do we already have a task for this or should I create a new one for this part?

I don't see clear disadvantages of doing the indexing Multichill suggests.

I don't see any mentioned here either, besides not indexing some specify ones (page number, e.g.).

Compared to pubmed article titles, it seems at least as useful.

Smalyshev removed Smalyshev as the assignee of this task.Wed, May 30, 10:15 PM
debt moved this task from Done to Backlog on the Discovery-Search (Current work) board.

Moving this to the backlog for now, as conversation is still ongoing but no clear owner of the work to be done.