Page MenuHomePhabricator

[ES-M3]: Investigate how search could work by label and aliases on the EntitySchema expert
Closed, ResolvedPublic

Description

Once the EntitySchema expert is created in T362004, we would like to make it easier for users to search for EntitySchema by the label and aliases.

Before we do this must investigate how search could work by label and aliases on the EntitySchema expert.

Acceptance Criteria

  • A technical direction is documented on how we can enable search by label and aliases on the EntitySchema expert

Notes
This should be timeboxed before we being the investigation

Event Timeline

Arian_Bozorg renamed this task from [ES-M2]: Investigate how search could work by label and aliases on the EntitySchema expert to [ES-M3]: Investigate how search could work by label and aliases on the EntitySchema expert.May 29 2024, 11:17 AM

Change #1056116 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/EntitySchema@master] [PoC] Use Wikibase Lib's term store in ES

https://gerrit.wikimedia.org/r/1056116

Hm, I’ll admit this isn’t what I expected from this task 😅 I’ll leave some comments here rather than on Gerrit, since they’re pretty general:

  • I assumed the main subject of this investigation would be ElasticSearch integration – after all, we’re not using the term store for searching Items and Properties in production. Did you look into ElasticSearch, or other alternative approaches to the term store? Are there known reasons why we should use the term store instead? (The main downside of the term store that I’m aware of is that any search based on it is case sensitive, at least with the current implementation. I think there are ways to make term store search case insensitive, and they would actually be useful to some third-party Wikibases, but they haven’t been a product priority so far.)
  • The attached patch goes a lot further than I would expect from an investigation task – if we can use it, that’s really cool!
  • I’m quite happy to see the term store being used in an extension, because in my view this paves the way towards using the term store for showing links to other pages – which is something that, at least for Lexemes, I assume we want to do sooner or later (rather than the current approach of actually loading and parsing the full Lexeme / EntitySchema). The biggest TODO left in the PoC change is probably the FindUnusedTermTrait::findActuallyUnusedTermInLangIds() hook (or similar) that you already mention in the commit message.

Hm, I’ll admit this isn’t what I expected from this task 😅 I’ll leave some comments here rather than on Gerrit, since they’re pretty general:

  • I assumed the main subject of this investigation would be ElasticSearch integration – after all, we’re not using the term store for searching Items and Properties in production. Did you look into ElasticSearch, or other alternative approaches to the term store? Are there known reasons why we should use the term store instead? (The main downside of the term store that I’m aware of is that any search based on it is case sensitive, at least with the current implementation. I think there are ways to make term store search case insensitive, and they would actually be useful to some third-party Wikibases, but they haven’t been a product priority so far.)

I briefly looked into this and I thought making the WikibaseCirrusSearch extension work with pseudo-entities would be to much work. Maybe I'm missing something or the code is not as tightly bound to Wikibase (mostly data model) as it seemed to me, but forking the extension for ES might have been the only path forward here, and I don't think that's worthwhile.

For the number of entity schemas we will have for the foreseeable future I think this should work well enough.

  • The attached patch goes a lot further than I would expect from an investigation task – if we can use it, that’s really cool!

I tried to make sure everything (except pruning, see the todos) can be made to work.

  • I’m quite happy to see the term store being used in an extension, because in my view this paves the way towards using the term store for showing links to other pages – which is something that, at least for Lexemes, I assume we want to do sooner or later (rather than the current approach of actually loading and parsing the full Lexeme / EntitySchema). The biggest TODO left in the PoC change is probably the FindUnusedTermTrait::findActuallyUnusedTermInLangIds() hook (or similar) that you already mention in the commit message.

Indeed, but I think this shouldn't be particularly hard… we probably only need a hook to provide us with the tables+columns to check.

I briefly looked into this and I thought making the WikibaseCirrusSearch extension work with pseudo-entities would be to much work. Maybe I'm missing something or the code is not as tightly bound to Wikibase (mostly data model) as it seemed to me, but forking the extension for ES might have been the only path forward here, and I don't think that's worthwhile.

I think forking it is more or less what I had in mind, though we could include the code directly in EntitySchema without making it a separate extension (WikibaseMediaInfo also does it this way, I believe).

If we go with the term store – is it okay for product that the EntitySchema search is case-sensitive? Or should we try to add a case-insensitive search based on the term store? (I think it would be relatively doable in the current SQL schema by introducing normalized labels/aliases as a separate term type, like how wb_terms used to have term_search_key in addition to term_text, pointing to the same text_in_lang/text tables. But it would take some more work, of course.)

If we go with the term store – is it okay for product that the EntitySchema search is case-sensitive? Or should we try to add a case-insensitive search based on the term store? (I think it would be relatively doable in the current SQL schema by introducing normalized labels/aliases as a separate term type, like how wb_terms used to have term_search_key in addition to term_text, pointing to the same text_in_lang/text tables. But it would take some more work, of course.)

I fear it is too confusing if case-insensitive search doesn't work for making some statements but not others and will probably lead to people assuming that an EntitySchema doesn't exist and worst case create a new duplicate one.

This ticket is marked as ready for peer review, but peer review seems already to have started. @hoo will you move this back into development based on the feedback?

hoo removed hoo as the assignee of this task.Aug 21 2024, 7:32 PM
hoo subscribed.

Due to problems with my local development setup (I had huge troubles setting up WikibaseCirrusSearch… while not a strict requirement for this task, I figured if I should really get that to work), I didn't manage to make significant progress here.

I can go back to this next week or someone else can pick it up at will.

Change #1071938 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/EntitySchema@master] [PoC] Use CirrusSearch/ Elastic for label search

https://gerrit.wikimedia.org/r/1071938

I have created an initial PoC which uses CirrusSearch for labels only, similar to WikibaseCirrusSearch (but unlike eg. WikibaseMediaInfo we can't reuse any of its infrastructure). This is mostly code copied over from WikibaseCirrusSearch and modified to both not interfere with its search fields (as we can't re-use them, because WikibaseCirrusSearch expects actual entity-entities) and to allow working with non-entity entities.

The code is wired up via EntitySchemaContentHandler in a manner similar Wikibase's EntityHandler (but incomplete for now, see TODOs).

This is how we could tackle this:

  1. Create (by copying and removing Wikibase dependencies from) the relevant field definitions for labels (for other entity types the labels and labels_all field - labels_all is used for supporting language fallback)
    1. This is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/LabelsField.php and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/AllLabelsField.php in the PoC
  2. Wire these up in EntitySchemaContentHandler: With that done we will have our labels indexed, ready to be queried.
    1. This is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/MediaWiki/Content/EntitySchemaContentHandler.php in the PoC
  3. Create a new EntitySearchHelper implementation that can be switched in via configuration, similar to EntitySearchElastic in the WikibaseCirrusSearchExtension.
    1. In the PoC this is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1071938/1/src/Wikibase/Search/EntitySchemaSearchHelper.php (although while working this is fairly incomplete).

I think this is not going to affect Special:Search at all as we are (forced to use) different indexes so haslabel:… wont work for EntitySchemas (but that is beyond the scope of this ticket).

hoo claimed this task.

See T375641 for implementing this, based on my findings creating the WikibaseCirrusSearch based proof of concept.