Page MenuHomePhabricator

Add wikibase client support for searching wikidata items
Closed, InvalidPublic

Description

Right now, Wikibase Client provides access to searching Wikidata items, e.g. via newTermSearchInteractor(). It is used in ArticlePlaceholder and Lua clients probably use some version of it too. It may make sense to make ElasticSearch searching available via this API too, possibly by implementing TermSearchInteractor that uses ElasticSearch.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Problematic part here seems to be that WikibaseClient has separate configuration from WikibaseRepo, so we can not access search profiles, and without those we can't run the search. Not sure what is the right way to handle it.

Smalyshev changed the task status from Open to Stalled.Jan 13 2018, 1:23 AM

One possible way may be to use cirrus-config-dump API (or make Cirrus somehow do it for us). But it's kinda heavyweight... OTOH, cross-wiki search uses it so maybe it's ok.
@dcausse - what do you think - would it be possible to make profile management somehow have "remote wiki" mode that would load configs from other wiki for cases like this? We already do it in sister search but probably not in a way that is reusable?

This is in theory possible but the problem is that some profiles refer to some class implementations that are maybe not available on the host wiki.
So yes we could use the sister search logic with some adaptation but the may blocker will be that the builder implementation won't be available if the wikibase extension is not loaded on the host wiki.
We will have similar problems with SDoC search sooner or later since search on commons is available from all wikis. If SDoC search provides some custom implementations the code will have to be available on the host wiki.
In short, if the WikibaseClient imports the builder classes then it's probably fine, if not I think it'll be hard to do it.

EDIT: another problem is that cirrus-dump-config exports Cirrus config, but not all the profile information is stored as wgCirrusSearch* vars, we would have to adapt cirrus-dump-config so that it can export a state of the search profiles that can be reloaded on the host wiki.

@Smalyshev What's the status here? Say we want to get rid of the wb_terms table…

@hoo I am still not sure what would be a good way to get search configs to the client... Maybe extension to cirrus-config-dump API is needed. It is also confounded by the fact that Cirrus and Wikibase use different configs (which feed from the same globals, normally, but they do not cooperate in any way AFAIK) which makes it kind of hard to inject stuff. How urgent is this? If it's important near term, I can allocate specific time to work on it closely and find solution, otherwise I'll think about it and get to it a bit later.

So I thought about it a bit more and looks like we don't really need to bring search configs from repo - we can have a set of fixed config that are enough for simple straightforward match on client, and have them baked into client, and use that instead of repo ones.

Does it mean that we would make WikbaseClient dependent on CirrusSearch and create all necessary query builders into this client?
Have we considered the possibility to run an actual API call to wbsearchentities@wikidata.org?
I have no clue if the current API output would allow to rebuild TermSearchResult nor if there are perf considerations that make this solution impossible.

Does it mean that we would make WikbaseClient dependent on CirrusSearch

Well, ideally after T190022: Separate the CirrusSearch/Elastic-specific code from Wikibase code base it all will be in WikibaseCirrusSearch extension I presume.

and create all necessary query builders into this client?

Yes, that's the idea.

Have we considered the possibility to run an actual API call to wbsearchentities@wikidata.org

I thought about it but it looks rather serious performance hit (going back through all caching infrastructure, getting all the request init overhead again and then parsing the results). And I understand the main motivation here is performance. If we have page with Lua that requests 20 lookups, having 20 sub-requests may be a bit too much.

It also feels a bit wrong to go whole roundtrip when we have most classes and configs sitting right here.

I have no clue if the current API output would allow to rebuild TermSearchResult

Probably but I am not convinced we should do it. I am right now leaning to the side of we shouldn't.

This ticket conflates tow very different things, which makes it difficult to discuss tradeoffs:
#1 looking up properties by label (PropertyIdResolver)
#2 interactively searching for items based on some search input

For #1 performance is an issue, and API calls are a no-go, since they would have to happen during parsing, and we may be doing dozens or even hundreds of them per page.
For #2, API calls would be fine, we have much more time, and only ever one search per request.

The two use cases also need very different search profiles. I suggest to discuss them in separate tickets.

I am not sure how looking up properties by label is different from looking up items by label. Am I missing something here? Are only properties but not items allowed to be looked up by label? I feel like I am missing some context here.

OK, so #1 is basically T194143: Make PropertyLabelResolver that uses ElasticSearch. So I think it should be discussed here.

Which leaves us with #2, which is implementing TermSearchInteractor that can do ElasticSearch. For this, we need to identify the use cases for it. I'll look for them and update the task description accordingly.

@Smalyshev this is not the priority right now. We'll try some other approach for ArticlePlaceholder later in Fall. For Lua we'd also be thinking options.
So this remains stalled. We'll get back to you guys, if we decide to pursue with Elastic search. But this not going to happen in the next weeks certainly.

Aklapper changed the task status from Stalled to Open.Nov 3 2020, 3:29 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

Closing as invalid as I struggle to see the context and figure out what / why now.
It should also be noted that many things in the area have changed since 2017