Page MenuHomePhabricator

Allow limiting lexeme searches by language
Closed, ResolvedPublic5 Estimated Story Points

Description

Problem:
Special:Search right now only allows for searching across all languages in the Lexeme namespace. It would be useful to allow to restrict the search to a specific language in order to make finding the right Lexeme easier.

In order to do this we should introduce new cirrus search keywords. These could be haslemma:en and inlanguage:Q1860.

Example:
A search for "a" to find the English indefinite article. It is currently the 17th result.

BDD
GIVEN a Lexeme search
AND a keyword "inlanguage:Q1860" or "inlanguage:en"
THEN the results only contain Lexemes with English as the Lexeme language

GIVEN a Lexeme search
AND a keyword "haslemma:en"
THEN the results only contain Lexemes with English as one Lemma's spelling variant

Acceptance criteria:

  • Results on Special:Search can be restricted by language via 2 new keywords

Notes:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Jdforrester-WMF ack, thanks.

Moving back to needs triage on our board to raise visibility, @DMartin-WMF could you help us understand the importance of this feature on your side, is it mainly to simplify the codebase or are there perf considerations on your side too?

Hi @dcausse, there are definitely performance considerations. Wikifunctions needs a new capability, which I'm starting to work on, to "get all IDs for lexemes, in a particular language, that are related to a given item". Currently this requires a 2 step process: (1) perform the search (examples above) to get the IDs of all lexemes related to the given item; then (2) fetch all of the lexemes identified by the search (of which there can be many), in order to determine which of them belongs to the given language.

With this new feature, step (2) would be completely unnecessary.

So there is a clear performance benefit. At present we cannot say whether the 2-step approach will cause timeouts or noticeable delay or other performance pain; this isn't known yet. But we do have the general objective, and have been strongly encouraged, to minimize the number of fetches that we make from Wikidata.

Gehel set the point value for this task to 5.Jan 20 2025, 4:45 PM

Hi @Gehel - Thanks for moving this forward on this. My team leadership has asked me to check-in regarding status. It would be helpful to know what's the current expectation for when the implementation work will be completed. And after that, it's our understanding that the needed indexing could happen by the end of March; is that accurate? Thanks!

Change #1121666 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: configure wgCirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121666

Change #1121667 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Add CirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121667

@DMartin-WMF this ticket is about adding two separate features:

  • tweaking the existing inlanguage keyword to filter lexeme by language code and/or entity
  • a new keyword for filtering by lemma spelling variants.

I think you're interested in the former which can be available relatively soon (once the two patches above are reviewed & deployed) and won't require a re-indexing since the data is already indexed.

Change #1121667 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add CirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121667

Thanks, @dcausse ! Yes, that makes sense and that is what we are interested in.

Change #1121666 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: configure wgCirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121666

Mentioned in SAL (#wikimedia-operations) [2025-03-06T08:44:30Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-06T08:47:41Z] <dcausse@deploy2002> dcausse: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-06T08:56:23Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] (duration: 11m 53s)

@DMartin-WMF this should be live now: inlaguage:en should now properly filter english lexemes.

moving to Blocked/Waiting, next part of this task is a bit more involved and requires adding more fields and change some schemas which I'd prefer to do once T375821 is done.

My thanks also, @dcausse ! I'm finding the new parameter is also working with my action=query requests with srsearch=haswbstatement. Here's an update to the example I mentioned in my comment of January 12, above.

Change #1126111 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Add a lemma_spelling_variants field

https://gerrit.wikimedia.org/r/1126111

Change #1126553 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Add lemmaspellingvariant keyword

https://gerrit.wikimedia.org/r/1126553

In terms of order of operations, how does this all need to be deployed? I'm guessing something like:

  1. Update schema to allow transport of new fields
  2. Update SUP to use new schema, to accept and pass on new fields
  3. Update Cirrus to start providing new fields and create appropriate indices
  4. Reindex wikis to make new fields queryable
  5. Update Cirrus to query new fields (likely via config provided by earlier patch)

@EBernhardson correct, this is currently blocked by T375821 that's going to enable the v1 schema, https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-primary/-/merge_requests/12 being a patch on top of v1.
I'm a bit torn by the fact that cirrus fields are part of the schema, on one hand I like that fields are now properly documented but on the other hand the schema for the fields is not strictly required and the pipeline could in theory work schema less for cirrus fields.

Change #1126111 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Add a lemma_spelling_variants field

https://gerrit.wikimedia.org/r/1126111

@Nikki (or anyone else interested in filtering on lemma spelling variants) while working on this we realized that some clarifications might be needed.
The new search keyword we will add is currently named lemmaspellingvariant, it's not ideal because quite long but I found that haslemma was too ambiguous (please let us know if you have objections/suggestions).
The use of this keyword will be like other keywords and quite independent from the rest of the search query, for instance: aluminium lemmaspellingvariant:en-us will find https://www.wikidata.org/wiki/Lexeme:L18179. From the ticket description I think this is what is expected but if not please let us know. Allowing to match a particular lemma string against its specific language variant will require some thinking on our side and is not entirely trivial.

@Nikki (or anyone else interested in filtering on lemma spelling variants) while working on this we realized that some clarifications might be needed.
The new search keyword we will add is currently named lemmaspellingvariant, it's not ideal because quite long but I found that haslemma was too ambiguous (please let us know if you have objections/suggestions).

Why was it too ambiguous? The idea was to match the existing haslabel, hasdescription and hascaption keywords (https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch#haslabel/hascaption) - lemmas are effectively labels for lexemes, so it makes sense for the lemma keywords to be similar to the label keywords.

I'd really like to avoid overly long keywords. "haswbstatement" is already problematic. It's tedious to type on a normal keyboard, it's even harder to enter on a phone and it takes up nearly a third of the visible input space (even on a big monitor). "lemmaspellingvariant" would be even longer, so even worse.

The use of this keyword will be like other keywords and quite independent from the rest of the search query, for instance: aluminium lemmaspellingvariant:en-us will find https://www.wikidata.org/wiki/Lexeme:L18179. From the ticket description I think this is what is expected but if not please let us know.

Yes, it's intended to behave the same as haslabel.

Allowing to match a particular lemma string against its specific language variant will require some thinking on our side and is not entirely trivial.

I would expect that to called inlemma and behave the same as the existing inlabel, indescription and incaption keywords (https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch#inlabel/incaption).

Why was it too ambiguous? The idea was to match the existing haslabel, hasdescription and hascaption keywords (https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch#haslabel/hascaption) - lemmas are effectively labels for lexemes, so it makes sense for the lemma keywords to be similar to the label keywords.

Thanks, this makes sense. It hadn't occurred to me to compare this with the existing labels keywords, haslemma would indeed be coherent with what we provide with haslabel. I'll adjust the patch to name it haslemma.

We're waiting on a reindex for this to become usable.

Change #1126553 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Add haslemme keyword

https://gerrit.wikimedia.org/r/1126553

Change #1180132 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Register the haslemma keyword

https://gerrit.wikimedia.org/r/1180132

Change #1180132 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Register the haslemma keyword

https://gerrit.wikimedia.org/r/1180132

Mentioned in SAL (#wikimedia-operations) [2025-08-28T07:27:01Z] <dcausse> T271776: reindexing all lexemes in testwikidatawiki

Mentioned in SAL (#wikimedia-operations) [2025-08-28T07:27:14Z] <dcausse> T271776: reindexing all lexemes in wikidatawiki