Page MenuHomePhabricator

Allow limiting lexeme searches by language
Open, MediumPublic5 Estimated Story Points

Description

Problem:
Special:Search right now only allows for searching across all languages in the Lexeme namespace. It would be useful to allow to restrict the search to a specific language in order to make finding the right Lexeme easier.

In order to do this we should introduce new cirrus search keywords. These could be haslemma:en and inlanguage:Q1860.

Example:
A search for "a" to find the English indefinite article. It is currently the 17th result.

BDD
GIVEN a Lexeme search
AND a keyword "inlanguage:Q1860" or "inlanguage:en"
THEN the results only contain Lexemes with English as the Lexeme language

GIVEN a Lexeme search
AND a keyword "haslemma:en"
THEN the results only contain Lexemes with English as one Lemma's spelling variant

Acceptance criteria:

  • Results on Special:Search can be restricted by language via 2 new keywords

Notes:

Details

TitleReferenceAuthorSource BranchDest Branch
Update test resources to match latest schemarepos/search-platform/cirrus-streaming-updater!183dcausseT271776-update-test-resources-to-match-latest-schemamain
search: add lemme_spelling_variants to cirrus/indexrepos/data-engineering/airflow-dags!1115dcausseT271776-search-cirrus-index-add-lemma-spelling-variantsmain
cirrus: lemma_spelling_variantsrepos/data-engineering/schemas-event-primary!12dcaussecirrus-add-lemma-spelling-variantsmaster
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
CBogen triaged this task as Medium priority.Jan 25 2021, 4:16 PM
CBogen moved this task from needs triage to Wikibase Search on the Discovery-Search board.

In order to do this we should introduce new cirrus search keywords. These could be haslemma:en and haslang:Q1860.

FWIW, haslang: can be approximated today as linksto:; a linksto:Q1860 returns the English indefinite article as the first result, at least. (linksto: can also approximate the lexical category, or grammatical features; it’s not specific to any of these fields, but since there’s limited overlap between languages, lexical categories, and grammatical features, linksto: can still be useful until this task is properly implemented.)

dcausse subscribed.

the data seems to be indexed so it might be trivial to implement these keywords, moving to needs triage to raise visibility.

the data seems to be indexed so it might be trivial to implement these keywords, moving to needs triage to raise visibility.

I think I spoke too soon, I believe that we have some of this data indexed: the lexeme language but not the lemmas language variants.
If I understood correctly we could implement haslang but sadly not haslemma without ingesting more data into the search index.
Regarding haslang CirrusSearch already has an inlanguage keyword so I'm afraid that adding a new haslang might be ambiguous, would it work if we re-used this keyword:

  • inlanguage:en/inlanguage:Q1860 could find lexeme tagged with the Q1860 language code, would this be OK?

Thanks for the info above, @dcausse. Wikifunctions is starting to use action=query requests with srsearch=haswbstatement (examples below), to determine which lexemes are related to a given item, for an important use case. Additional indexing support was added recently for this purpose, by T378097.

If the above-suggested parameter taking the ISO code (like inlanguage:en) is made available, that would give a big win for our use case, allowing for a smaller number of fetches from Wikidata, and for our code to be simpler and more efficient.

Examples of searches that we are starting to use, where we would love to be able to add inlanguage:

If support is added for inlanguage for filtering lexemes, would that apply to these sorts of searches?

@DMartin-WMF regarding inlanguage:en also matching Q1860 I'm not sure about the implementation details and it might be possible that CirrusSearch would have to do some lookups too (or have a map defined in its config) if it provided such feature. First I wanted to know if re-using inlanguage would be OK since the description explicitly asked for a new keyword haslang.

Re-using the inlanguage parameter should be fine, yes.

@Jdforrester-WMF ack, thanks.

Moving back to needs triage on our board to raise visibility, @DMartin-WMF could you help us understand the importance of this feature on your side, is it mainly to simplify the codebase or are there perf considerations on your side too?

Hi @dcausse, there are definitely performance considerations. Wikifunctions needs a new capability, which I'm starting to work on, to "get all IDs for lexemes, in a particular language, that are related to a given item". Currently this requires a 2 step process: (1) perform the search (examples above) to get the IDs of all lexemes related to the given item; then (2) fetch all of the lexemes identified by the search (of which there can be many), in order to determine which of them belongs to the given language.

With this new feature, step (2) would be completely unnecessary.

So there is a clear performance benefit. At present we cannot say whether the 2-step approach will cause timeouts or noticeable delay or other performance pain; this isn't known yet. But we do have the general objective, and have been strongly encouraged, to minimize the number of fetches that we make from Wikidata.

Gehel set the point value for this task to 5.Jan 20 2025, 4:45 PM

Hi @Gehel - Thanks for moving this forward on this. My team leadership has asked me to check-in regarding status. It would be helpful to know what's the current expectation for when the implementation work will be completed. And after that, it's our understanding that the needed indexing could happen by the end of March; is that accurate? Thanks!

Change #1121666 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: configure wgCirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121666

Change #1121667 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Add CirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121667

@DMartin-WMF this ticket is about adding two separate features:

  • tweaking the existing inlanguage keyword to filter lexeme by language code and/or entity
  • a new keyword for filtering by lemma spelling variants.

I think you're interested in the former which can be available relatively soon (once the two patches above are reviewed & deployed) and won't require a re-indexing since the data is already indexed.

Change #1121667 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add CirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121667

Thanks, @dcausse ! Yes, that makes sense and that is what we are interested in.

Change #1121666 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: configure wgCirrusSearchLanguageKeywordExtraFields

https://gerrit.wikimedia.org/r/1121666

Mentioned in SAL (#wikimedia-operations) [2025-03-06T08:44:30Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-06T08:47:41Z] <dcausse@deploy2002> dcausse: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-06T08:56:23Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] (duration: 11m 53s)

@DMartin-WMF this should be live now: inlaguage:en should now properly filter english lexemes.

moving to Blocked/Waiting, next part of this task is a bit more involved and requires adding more fields and change some schemas which I'd prefer to do once T375821 is done.

My thanks also, @dcausse ! I'm finding the new parameter is also working with my action=query requests with srsearch=haswbstatement. Here's an update to the example I mentioned in my comment of January 12, above.

Change #1126111 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Add a lemma_spelling_variants field

https://gerrit.wikimedia.org/r/1126111

Change #1126553 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Add lemmaspellingvariant keyword

https://gerrit.wikimedia.org/r/1126553

In terms of order of operations, how does this all need to be deployed? I'm guessing something like:

  1. Update schema to allow transport of new fields
  2. Update SUP to use new schema, to accept and pass on new fields
  3. Update Cirrus to start providing new fields and create appropriate indices
  4. Reindex wikis to make new fields queryable
  5. Update Cirrus to query new fields (likely via config provided by earlier patch)

@EBernhardson correct, this is currently blocked by T375821 that's going to enable the v1 schema, https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-primary/-/merge_requests/12 being a patch on top of v1.
I'm a bit torn by the fact that cirrus fields are part of the schema, on one hand I like that fields are now properly documented but on the other hand the schema for the fields is not strictly required and the pipeline could in theory work schema less for cirrus fields.