Page MenuHomePhabricator

Index Wikidata strings in statements for fulltext search
Closed, ResolvedPublic

Description

While string properties are currently indexed for haswbstatement keyword, they are not locatable with regular search, e.g. searching for "SK-C-5" (the inventory number) or "177124540" (the viaf id) won't return the item https://www.wikidata.org/wiki/Q219831.

As a user I want to be able to search for a string that is somewhere on the item and find the item.

We need to figure out the best place to index this string and how to make it match the item of generic search.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

OK, looking at current usage, there are only 21 string properties with more than 100K values. Looking at them in particular, the interesting ones are:

HomoloGene ID (P593) - probably should be external ID. There are more like this, with less usage.

Over a million usages:

  • page(s) (P304) - 15332300 items.
  • volume (P478) - 15288265 items.
  • issue (P433) - 13757879 items

These are mostly used for scientific articles and IMO useless for search. We may want to exclude them (not sure about volume/issue but if we want to do bibliographical searches we probably need to have more robust model anyway).

I'd agree.

  • taxon name (P225) - 2480324
  • Commons category (P373) - 2122490

These might be actually useful for searches.

Jep those sound like ones people will want to find in search.

The rest have much lesser usage, and even though some of them may also be useless for searches, adding those won't be that big of a deal.

Agreed.

Also, I am a bit concerned about properties like Wikidata SPARQL query equivalent (P3921) - should we have size limits on property value? I don't want to have 2K of text in the index there, not because it would hurt the index (probably not) but because it's useless - nobody is going to search for such value.

There is a size limit on string values already. I don't remember the exact limit right now. Or are you looking for something else?

There is a size limit on string values already. I don't remember the exact limit right now. Or are you looking for something else?

I was thinking about shorter limit - not sure it makes sense to look up something by whole SPARQL query... but maybe we should just exclude properties like this altogether.

Change 430277 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add capability to exclude properties from by-type index

https://gerrit.wikimedia.org/r/430277

Change 430277 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add capability to exclude properties from by-type index

https://gerrit.wikimedia.org/r/430277

@Lea_Lacroix_WMDE we need to make configs that enable indexing (will be done next thing) and then we need to actually reindex. Reindexing takes several days, so I planned to do it immediately after the Hackathon, unless you need it sooner.

Also, right now we can only locate by haswbstatement:P123=SK-C-5. If we want to index data without attached property IDs, we need to add different field & analyzer to do that. Should we do it?

Change 431994 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add string and external-id types to indexing

https://gerrit.wikimedia.org/r/431994

Change 431994 merged by jenkins-bot:
[operations/mediawiki-config@master] Add string and external-id types to Wikibase indexing

https://gerrit.wikimedia.org/r/431994

@Lea_Lacroix_WMDE Also, for newly edited items it should be working as soon as wmf.3 is deployed. But for older items it will need reindex.

Mentioned in SAL (#wikimedia-operations) [2018-05-08T23:22:45Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:431994|Add string and external-id types to Wikibase indexing]] T163642 T99899 (duration: 01m 26s)

I have no deadline in mind, I was just wondering when to announce it, and if you or me should do it :)

I'll note here when the reindex is done, and then I guess you can announce :) In the meantime I can check that everything works smoothly with edited entries.

@Lea_Lacroix_WMDE Also, for newly edited items it should be working as soon as wmf.3 is deployed. But for older items it will need reindex.

Good to see all this progress! Will a purge of an item or an edit to an item trigger a reindex of that item? Will https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump show the "P123=SK-C-5" somewhere?

@Multichill check out https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump - it has the data now.

Purge won't work though - you need to edit.

Mentioned in SAL (#wikimedia-operations) [2018-05-23T16:31:01Z] <SMalyshev> starting wikidata full reindex for T163642

@Smalyshev https://www.wikidata.org/w/index.php?search=%22SK-C-5%22 doesn't work yet, but this task has been closed. Can you explain? This is listed in the task description as something that should work.

@Multichill I think the point of this task were to index the statements, which is done. For searching, you can use haswbstatement for now. I am not sure whether it makes sense to copy the statement value into all field, where it would be then searchable by plain search too - may be useful for distinctive IDs but I am not sure how many of them are distinctive... I think it's better to make a separate task for this.

Hmm, I'm not sure this is all that useful at least as it stands. Most external id's can be as easily found now via the Wikidata Resolver tool - https://tools.wmflabs.org/wikidata-todo/resolver.php - However, what I would find useful would be a way to locate for example partial street addresses - this (P969) is often entered as a qualifier on headquarters location (P159). Searching for' haswbstatement:P969=Main' now finds something, but only because that oddly has just 'Main' as the value for P969, and making the string lowercase ("main") finds nothing, which is definitely not what I would expect on this... I don't think treating string values as if they were identifiers is the right approach, the usefulness of a search engine is in normalizing string values so you can find them without having the exact matching string. And qualifiers should be folded in somehow!

@Multichill I think the point of this task were to index the statements, which is done. For searching, you can use haswbstatement for now. I am not sure whether it makes sense to copy the statement value into all field, where it would be then searchable by plain search too - may be useful for distinctive IDs but I am not sure how many of them are distinctive... I think it's better to make a separate task for this.

That was not my point of this task. https://www.wikidata.org/w/index.php?search=%22SK-C-5%22 should return https://www.wikidata.org/wiki/Q219831 . in my view the haswbstatement step is an intermediate one. Sorry for not being clear enough.

Taking a step back for a bigger overview. As a user I expect all the text I see on https://www.wikidata.org/wiki/Q21983 to be searchable as plain text. Currently we only index the labels, aliases and descriptions in the text (https://www.wikidata.org/w/index.php?title=Q219831&action=cirrusdump). Also all the statements should be added. That includes the string statements, but also the labels of the used items. Looking at https://www.wikidata.org/wiki/Special:EntityData/Q219831.rdf I realize this would be quite an increase of the search data. Would probably make sense to have localized plain text fields like "text_nl" which only have the data in that language. Why do this? Trying to use our search to find something is hard. I was in the Thyssen Bornemisza museum and trying to use search to find the Van Gogh paintings on Wikidata. That's currently impossible. Do we already have a task for this or should I create a new one for this part?

I don't see clear disadvantages of doing the indexing Multichill suggests.

I don't see any mentioned here either, besides not indexing some specify ones (page number, e.g.).

Compared to pubmed article titles, it seems at least as useful.

debt moved this task from Needs Reporting to Incoming on the Discovery-Search (Current work) board.

Moving this to the backlog for now, as conversation is still ongoing but no clear owner of the work to be done.

@Smalyshev / @debt :I think this is one of those tasks where we have a bit of a misunderstanding about scope (see https://lists.wikimedia.org/pipermail/wikidata/2018-August/012282.html ). Close this one as resolved and make clearly scoped follow up tasks to untangle this? :-)

Smalyshev renamed this task from Index Wikidata strings in statements in the search engine to Index Wikidata strings in statements for generic search.Aug 10 2018, 8:16 PM
Smalyshev renamed this task from Index Wikidata strings in statements for generic search to Index Wikidata strings in statements for fulltext search.
Smalyshev updated the task description. (Show Details)

@Multichill I think with new description it is clearer what this is about.

If this ticket is about matching (without using any kind of search keyword) an entity referencing a string regardless of its usage (label/alias/statements) then why not simply put all the statement strings into the text field or auxilliary_text (it's not used currently)?
To save some space we could also stop populating the source_text field for entities I doubt that insource is helpful on wikidata.

Putting it into auxilliary_text might be a bit tricky, because right now we extract statements values etc. as separate fields, and they are in the format of P123=String or P123=Q456 (and we don't know which it would be). Putting Q-ids into auxilliary_text is probably pointless (correct me if I am missing something here) so we'd need some mechanism for fields to also extract the parts that need to go into auxilliary_text I presume? Will look into it a bit more to see what can be done there.

I tried copying statement_keywords field into all field (for random 1M items) and the results don't seem to be too encouraging - all field e.g. tokenizes tt0041008 as two tokens tt and 0041008. When searching, it does produce Q18636386 which it should be, but also for example Q507445 (which has TT in it's name) with higher score. So I don't think copying it into text fields would work.

@dcausse, if you want to check it it's in stas_wikidata_test index on relforge.

Change 456026 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add capability to search for statement values directly

https://gerrit.wikimedia.org/r/456026

Change 458074 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add statement fields to "all" field.

https://gerrit.wikimedia.org/r/458074

Change 458293 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add phrase rescoring to queries

https://gerrit.wikimedia.org/r/458293

Change 458074 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add statement fields to "all" field.

https://gerrit.wikimedia.org/r/458074

@Multichill I think with new description it is clearer what this is about.

Sorry, kind of missed this comment. I think this is clearer. Thanks for picking this up :-)

Change 458293 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add phrase rescoring to queries

https://gerrit.wikimedia.org/r/458293

Change 462347 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add phrase rescoring to config

https://gerrit.wikimedia.org/r/462347

Change 462349 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Give phrase profile own name

https://gerrit.wikimedia.org/r/462349

Change 462351 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Enable phrase search config

https://gerrit.wikimedia.org/r/462351

Change 462349 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Give phrase profile own name

https://gerrit.wikimedia.org/r/462349

Smalyshev moved this task from Done to In review on the User-Smalyshev board.

Change 462347 merged by jenkins-bot:
[operations/mediawiki-config@master] Add phrase rescoring to config

https://gerrit.wikimedia.org/r/462347

Change 462351 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable phrase search config

https://gerrit.wikimedia.org/r/462351

Mentioned in SAL (#wikimedia-operations) [2018-09-27T23:29:14Z] <thcipriani@deploy1001> Synchronized wmf-config/WikibaseSearchSettings.php: SWAT: [[gerrit:462351|Enable phrase search config]] T163642 (duration: 00m 56s)

Change 456026 abandoned by Smalyshev:
Add capability to search for statement values directly

Reason:
We went with another option

https://gerrit.wikimedia.org/r/456026