Page MenuHomePhabricator

Add keyword for filtering based on captions in specific language
Closed, ResolvedPublic

Description

Per a discussion in today's SDC-Search checkin meeting:

Use Case:

A user wants to search for the word "gift", but specifically the German word, not the English one (regardless of what may be their primary language)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.
Ramsey-WMF added a subscriber: Cparle.
Ramsey-WMF added a subscriber: Abit.

@Ramsey-WMF Could we possibly get a bit more structured use cases?

Are those documented somewhere besides this ticket so we can see how this use case fits on the big picture? Is there any UI that goes with this case?

@Smalyshev @dcausse @Cparle

Proposed syntax as follows. Note that we can have an incaption alias of inlabel, but this will be implemented in WikibaseCirrusSearch where these are considered labels so the code in wikibase to filter by them should probably reference label. One potential sticking point is the syntax of specifying one or more languages. I'm not entirely convinced this is the best syntax, but I'm not sure we have something today to draw from as an example. The pipe usage here is slightly different than we use in other places. We could potentially replace the pipe with a comma, i'm not sure if that's better or worse.

Expected usage:

  • inlabel:gift inlabel:|gift inlabel:*|gift
    • Pages that have the word gift in labels_all
  • inlabel:de|gift
    • Pages the have the word gift in labels.de
  • inlabel:"en|gift wrap"
    • Pages that have the phrase "gift wrap" in labels.en
  • inlabel:"pt-br,pt|colaborativa"
    • Pages that have the word colaborativa in pt-br or pt
  • inlabel:pt-br*|colaborativa
    • Pages that have the world colaborativa in pt-br or the languages in it's fallback languages.
  • inlabel:en,unk|gift
    • Pages that have the word gift in labels.en. Additionally a warning that unk is an unknown language
  • inlabel:gift|wrap
    • zero results. Warning that gift is an unknown language

Expected edge cases

Comma separated values:

  • inlabel:gift,wrap
    • Pages that have the words gift and wrap in labels_all. This is because elastic will consider the comma a generic token separator. This will work with a few different separator characters.
  • inlabel:en|gift|wrap
    • Pages that have the word gift and wrap in labels.en. same as above, tokenized on |. The language handling will only consume up to the first |, passing the rest to elastic.

spaces have to be quoted:

  • inlabel:en|gift wrap
    • Pages that have the word gift in labels.en and the word wrap in any part of the page

Other current problems:
The mapping on commons.wikimedia.org doesn't currently include the labels, meaning they are stored but not searchable. commonswiki on beta has the appropriate mapping, which means we need to reindex commonswiki. We've had other needs to run the reindex on commonswiki anyways, This will have to wait a few weeks though, we are expecting to roll out software upgrades next week and the following week that will prevent a reindex. Once those two upgrades are done we can do the reindex on prod commonswiki. This wont prevent beta cluster from working in any way.

I like the structure of the syntax but would probably bikeshed the exact delimiters a bit if possible (later). Also, are we following fallback chains or only seeking exact language match? If we match exactly we may want to also think about allowing fallbacks.

inlabel:"pt-br,pt,colaborativa"

Did you mean inlabel:"pt-br,pt|colaborativa" ?

inlabel:gift,wrap
Pages that have the words wrap and gift in labels.*, along with a warning that gift is an unknown language.

Why? Didn't you define | as language marker? This query has no | - why talk about the language?

inlabel:en,gift,wrap
Pages that have the words wrap and gift in labels.en.

Not sure I like this. One consistent syntax FTW. Note that most people would be interested in "these words in my language" anyway.

Speaking of which - we need to support this mode I think. Maybe label.* should be "use my interface language first, if not - use fallbacks, if not - use any language you like".

inlabel:en,gift wrap

Can we make inlabel:gift wrap greedy? Not sure yet it's the right thing to do, just asking.

I like the structure of the syntax but would probably bikeshed the exact delimiters a bit if possible (later). Also, are we following fallback chains or only seeking exact language match? If we match exactly we may want to also think about allowing fallbacks.

We probably should handle fallback chains, but I'm not entirely sure how. If we want to be explicit, how about require a *? so intitle:de|gift will search only de. intitle:de*|gift will search the fallback chain of de?

inlabel:"pt-br,pt,colaborativa"

Did you mean inlabel:"pt-br,pt|colaborativa" ?

Yes, i'll go back and edit. I started with , then used | then forget to switch them all.

Actually i hadn't thought about inlabel:pt-br,pt|colaborativa, that might be better than what I had with successive | characters. The successive pipes can be undetermined, but taking everything before the first pipe is very easy reason about.

I like the structure of the syntax but would probably bikeshed the exact delimiters a bit if possible (later). Also, are we following fallback chains or only seeking exact language match? If we match exactly we may want to also think about allowing fallbacks.

We should certainly allow fallbacks, what about with the * char? Then inlabel:de|gift will search labels.de, but inlabel:de*|gift will search the de fallback chain. I could see an argument as well for language fallbacks without the final english fallback, but not sure how to represent that.

inlabel:"pt-br,pt,colaborativa"

Did you mean inlabel:"pt-br,pt|colaborativa" ?

inlabel:gift,wrap
Pages that have the words wrap and gift in labels.*, along with a warning that gift is an unknown language.

Why? Didn't you define | as language marker? This query has no | - why talk about the language?

This was supposed to be inlabel:gift|wrap. I'm not sure if we should interpret gift|wrap as the query when gift isn't a language, or always treat the part before a pipe as the language regardless.

inlabel:en,gift,wrap
Pages that have the words wrap and gift in labels.en.

Not sure I like this. One consistent syntax FTW. Note that most people would be interested in "these words in my language" anyway.

Speaking of which - we need to support this mode I think. Maybe label.* should be "use my interface language first, if not - use fallbacks, if not - use any language you like".

Was supposed to be inlabel:en|gift|wrap, separating multiple languages with continued | chars. I wasn't much of a fan of this, the current update to require inlabel:a,b|wrap to choose languages a,b makes more sense.

inlabel:en,gift wrap

Can we make inlabel:gift wrap greedy? Not sure yet it's the right thing to do, just asking.

We could, but the plan has been to try to kill all the greedy things. Currently I think we have non-greedy alternatives for everything that's greedy but no way to shift clients.

I'm not sure if it's great, but i see two possible solutions to en being the final fallback language for almost everything:

  • Strip en as the final fallback and require it to be explicitly provided. inlabel:pt-br*|colaborativa would query pt and pt-br, to get en it requires inlabel:pt-br*,en|colaborativa
  • Let any query with fallbacks query en as well
  • Offer a syntax to remove languages, such as inlabel:pt-br*,-en|colaborativa. This also allows something like inlabel:zh*,-zh-hans,-zh-hant|foo

I'm leaning towards the first option, stripping en from fallbacks, but the third option seems plausible as well and might allow more flexibility.

Why not put the languages as a suffix?

  • inlabel:word@en: word in english
  • inlabel:word@fr*: word in fr and all its fallbacks
  • inlable:word@{pt,fr*,-fr-ca}: word in pt or fr and all its fallbacks except fr-ca
  • inlabel:foo|bar@{pt,fr*,-fr-ca} foo or bar in pt or fr and all its fallbacks except fr-ca
  • inlabel:foo@{fr*,-fr-BE}|bar@{pt}: foo in fr and all its fallbacks except fr-ca or bar in pt

In short I'm just suggesting to clarify the role of | as it in the previous comments it's used either as a separator of the languages but will also be our only way to allow OR for this keyword. This is only meaningful if we believe that ORing is important. For example a use case like: I want captions with escaliers in french or stairs in english.

As for fallbacks I don't have strong opinions but I like the third option, we should perhaps have 3 suffixes:

  • fr?: strictly fr
  • fr+: all french related variants
  • fr*: all french related variants + english

Then decide what is it the default when solely fr is used.

i like the suffix with @. The use case for | as or sounds pretty good, but allowing the two forms of | that have different language handling seems more indescisive to parse, especially since we are parsing mostly with explode or sometimes regex's. I wonder, how far are we from being able to special case | between two keywords as a real OR (as in: inlabel:foo@en | inlabel:bar@fr)? We don't necessarily have to implement that now, but that seems a better way forward than baking the second form (inlabel:foo@en|bar@fr) into inlabel. The first form still makes sense to bake in, reusing the language selections.

I have a feeling we're overdesigning it a little. I think it should be simple and cover 80% of cases, and if you need more complex things you'd probably be better with using generic boolean syntax like OR/AND.

inlabel:foo@en may be ok, though obviously we'd have trouble with labels that contain @.

As for fallbacks etc. I am not even sure whether we need to give options there. I'd propose to start with a sane default (e.g. using current language with fallback if language is not given, and specific language if it's given) and see what requests we'd get on it. I think in general it's better to start with something basic and see what people ask for in addition than design a whole DSL here only to learn that's not what people actually need.

In general ORing a keyword is only meaningful for keywords that match a code, it's rare when people ask us why thay can't do intitle:foo OR intitle:bar. But here since we have multiple languages I feel that one may ask for a word in two different languages and prefer having a OR between the two. I think you are right that we should not push too much on the first iteration.
We are not that far from supporting OR between keywords, that parser is ready we just need to write the query building code. We are perhaps close enough that it preferable to stick with the simple syntax.

As for search @ in labels, one way will be to escape it with \@ but we can also be lenient and consume all @ that are not valid language suffixes.

If I sum-up (if @ is not appropriate because too widely used we can change it):

  • inlabel:foo@fr search foo fr with all its fallbacks
  • inlabel:foo: produces a warning
  • inlabel:foo@bar@fr: search foo@bar in with fr and all its fallbacks
  • inlabel:foo\@bar@fr: search foo@bar in with fr and all its fallbacks
  • inlabel:"foo bar@fr": search the phrase foo bar in fr and all its fallbacks, (feels a bit weird to have the suffix inside the double quotes)

I prefer to force the language because using the current language will produce queries that may be hard to share.

Impossible usecases (for the moment):

  • search foo in fr OR en
    • will be supported midterm at the parser level: inlabel:foo@fr OR inlabel:foo@en
  • search foo OR bar in fr
    • will be supported midterm at the parser level: inlabel:foo@fr OR inlabel:bar@fr
  • search only in strict fr excluding a variant/fallback
    • will have to be implemented at the keyword level using a hint in the language suffix

But here since we have multiple languages I feel that one may ask for a word in two different languages and prefer having a OR between the two

Yes, this is possible, but I wonder how frequent it would be that somebody needs a label in Russian or Portuguese? I can invent a scenario, but it would not be a typical use case.

We are not that far from supporting OR between keywords, that parser is ready we just need to write the query building code

Maybe then we should wait for then instead of inventing mini-language for each keyword? Keep it simple and let the query builder do the complex stuff.

I prefer to force the language because using the current language will produce queries that may be hard to share.

That's a good point for shareable queries, but many people - just as in regular search - are just interested in finding stuff for themselves, not sharing it. For them, forcing them to specify the language feels wrong. I think we should DWIM here and use the current language environment if the language is not specified.

Change 491580 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikibaseCirrusSearch@master] implement inlabel search keyword

https://gerrit.wikimedia.org/r/491580

Note that this was implemented in the new WikibaseCirrusSearch extension, getting that deployed to beta cluster is T215684

Change 491580 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] implement inlabel search keyword

https://gerrit.wikimedia.org/r/491580

Hello @EBernhardson @Smalyshev Is this something that has an inpact on Wikidata as well? Would it be worth an announcement?

@Lea_Lacroix_WMDE yes, but it will be deployed when WikibaseCirrusSearch extension is fully deployed, thus we might want to wait with the announcement until then.

@Smalyshev Alright! Please ping me if I can help with anything :)