Page MenuHomePhabricator

[Story] Looking up entities by external identifiers
Closed, ResolvedPublic

Description

It would be very useful, especially for integration with 3rd party services, to be able to look up Entities based on the external identifiers associated with them.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Bene triaged this task as Medium priority.Jul 29 2015, 1:15 PM
JanZerebecki renamed this task from Allow looking up Entities by external identifiers. to [Story] Allow looking up Entities by external identifiers..Sep 10 2015, 9:18 PM
JanZerebecki moved this task from incoming to needs discussion or investigation on the Wikidata board.

One option to implement this would be via integration with Cirrus. This was roughly outlined during a discussion with Stas, Jan, and David. Here is an excerpt from my notes from that session:

Property Search and SiteLink Search
Goal: allow searches for property values, e.g. property:P212:978-2-07-027437-6. Similarly, allow searches for sitelinks, e.g. sitelink:enwiki:Foo.

Outline:

  • one field for property, one field for sitelink
  • populated with key value pairs as concatenated strings, e.g. P212:978-2-07-027437-6 or enwiki:Foo.
  • only consider exact matches

Needs:

  • an extension point for parsing the special syntax from the search box input
  • an extension point for defining the relevant fields
  • an extension point for feeding data to the relevant fields upon save / re-index

Can we also hardcode some standard, quasi-standard and widely used prefixes, e.g. searching for "isbn:1234567890123"?

Update: resolver now uses SPARQL instead of WDQ. Also, quick prefix system added (resolving property from English name fragment via SPARQL).

https://tools.wmflabs.org/wikidata-todo/resolver.php

I just merged T149108 with this task. There is a minor difference between two tasks to expand the look up of Entities from external identifier properties to other properties with "distinct singular value" linking to Commons pages (1:1 link).

If we expand the lookup to by sitelinks than we can also merge this task with T74815.

Thanks for the merge, @Jarekt. I believe we can do this.

Our current plan is to create an inverse index from all "external-id" values to the entity ID they appear in. Note that not even "external-id" values are technically limited to be unique, so this index must consider duplicates.

Considering this, I believe we can add all "string" properties to the same index and not run into any problems (as long as all queries on this index always include a specific property ID). This would solve the "Commons category" task.

Querying by sitelink is a different task and should be discussed separately.

Lydia_Pintscher renamed this task from [Story] Allow looking up Entities by external identifiers. to [Story] Looking up entities by external identifiers.Apr 7 2018, 11:33 AM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher subscribed.

@Smalyshev Is the right way in your opinion to go via Elastic for this? How far are we from making this a reality especially with your work on indexing specific statements?

Yes, we can make it work if we index external IDs in statements. I think we're not far for it - we should have infrastructure for doing the index, the only thing we want to decide whether we index all external ID properties (may require some coding, though shouldn't be complicated by now) or specific properties (this would be just a config change).
Once we made the index we need to create a search keyword like property: (that's some code but should be rather easy by now) and that should be done.

I would turn it on for all the properties that have "distinct values constraint" constraints, ensuring that each identifier points to a single item (except for constraint violations. Also per T149108 task (merged with this one) I hope this task will include Properties like Commons Creator page (P1472) (type string).

I would turn it on for all the properties that have "distinct values constraint" constraints

This is not possible OOB (as configuration doesn't have an option for this type of queries). The options are either all properties of specific type, or a defined list of properties. Of course, we can generate this list using the criteria above, but then the list should be regularly updated, with accompanying reindex. Is that what we want?

I hope this task will include Properties like Commons Creator page (P1472)

This is a String, not External ID, property. I am not sure we want to index all String properties, so here we probably need a list.

Currently we have 2614 properties of type ExternalId of these properties 2260 properties have both the single and distinct constraint ( https://query.wikidata.org/sparql?query=SELECT%20%3Fproperty%20%7B%20%3Fproperty%20wikibase%3ApropertyType%20wikibase%3AExternalId%20.%20%3Fproperty%20wdt%3AP2302%20wd%3AQ19474404%20.%20%3Fproperty%20wdt%3AP2302%20wd%3AQ21502410%20%7D ). You could make a list, but that sounds like a maintenance burden.

I would just index all ExternalId's and make the index assume the entries are distinct and unique, so when it encounters a second Wikidata item with the same external ID, it just overwrites it. No clue if this is possible in Elastic.

You could make a list, but that sounds like a maintenance burden.

Precisely.

I would just index all ExternalId's

I also think this is the best way.

make the index assume the entries are distinct and unique, so when it encounters a second Wikidata item with the same external ID, it just overwrites it

Index can't do that unfortunately. I don't even think there's such thing as unique field in ElasticSearch - the only field that is unique is the document ID.

So if you have two items with the same external ID, the search will find them both. Now if you build some service on top of it (like special page) it can interpret the search results and resolve the collision. But I see no way to not have duplicates in the search index if they are in the data.

make the index assume the entries are distinct and unique, so when it encounters a second Wikidata item with the same external ID, it just overwrites it

Index can't do that unfortunately. I don't even think there's such thing as unique field in ElasticSearch - the only field that is unique is the document ID.

So if you have two items with the same external ID, the search will find them both. Now if you build some service on top of it (like special page) it can interpret the search results and resolve the collision. But I see no way to not have duplicates in the search index if they are in the data.

I would do a really simple and stupid resolving approach: Take the top result and maybe do some sorting of the results by something like popularity_score .
Might be good to avoid making yet another special page and just use the API ( https://www.mediawiki.org/wiki/Special:MyLanguage/API:Search / https://www.wikidata.org/w/api.php?action=help&modules=query%2Bsearch ). That way you probably only need a bit of javascript to glue everything together. Now the javascript is just hitting https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities , could be changed to hit the search if someone starts with P<some integer>:something . Or you could expand wbsearchentities to do the dirty work for you, but I think you'll get a bit of code mix up.

Probably best to split this task up in two parts:

  1. Get the ExternalId's indexed
  2. Figure out a way for the user to access it

First part is probably clear now (just index all ExternalId's), second part probably needs a bit more thought.

Why is uniqueness even an issue? Just provide a way to search items by the value associated with a property. The result will be a ranked list, potentially incomplete if there are many matches (as opposed to a query, which would yield a complete list, but no ranking). If the client just wants the "best" match, it should just use the top match.

Yeah agreed. Let's just go with returning all matches if there are more than one. And yes it makes sense to index all IDs unless there is a technical reason not to.

As for accessing it: With a search keyword on special:search should be fine imho.

Change 427836 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add capability to index any property by type

https://gerrit.wikimedia.org/r/427836

Change 427836 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add capability to index any property by type

https://gerrit.wikimedia.org/r/427836

Change 431994 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add string and external-id types to indexing

https://gerrit.wikimedia.org/r/431994

Change 431994 merged by jenkins-bot:
[operations/mediawiki-config@master] Add string and external-id types to Wikibase indexing

https://gerrit.wikimedia.org/r/431994

Mentioned in SAL (#wikimedia-operations) [2018-05-08T23:22:45Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:431994|Add string and external-id types to Wikibase indexing]] T163642 T99899 (duration: 01m 26s)

New edits will have external-id indexed as soon as wmf.3 is deployed, old data needs full reindex.

In terms of functionalities, what's the outcome of this?

Is it just haswbstatement:P1234=ABCD in https://www.wikidata.org/w/index.php?search= ?

So how do we use this new feature?

So how do we use it? Should I create tickets for each way one might want to interface with it?

P1472 is not of datatype external identifier but string. (You can see this on the property page.)

Here is an example for one that works: https://www.wikidata.org/w/index.php?sort=relevance&search=haswbstatement%3AP650%3D77379&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns120=1 Hope that helps.

Lynda thanks for the explanation, I guess I had wrong idea about this task. Three years ago I created T149108 which was identical to this ticket except that it would be used for looking up entities by values of string properties. I was especially interested in access from Lua to properties that store links to pages on Commons, and since we do not have special datatype for interwiki links they are stored as strings. Since those 2 tickets were so similar after talking it over with others, I merged them and explained above the hope that solution to this ticket would be broad enough to also cover T149108, and the response I got was that this should not be much of a problem. But I guess this did not happen, so we are back to square one with T149108. The main motivation for T149108 is to allow for a page linked to (through a property) to be aware which entity links to it, the same way as it works for sitelinks.