Page MenuHomePhabricator

Allow to search pages in a specific language, e.g. without translations
Closed, ResolvedPublic

Description

For sites that install the Translate extension, searching translation-subpage content is problematic:

  • regex crawl through subpages
  • hastemplate usage-counts are skewed (if they worked, see T125926)
  • namespace counts are skewed. For example, on MediaWiki.org there are not 5100 uniquely defined template usage definitions (Template:)
  • intitle results are drowned out
  • prefix results are drowned out
  • where translations are in-progress, word and phrase searches get Search noise, and whatLinksHere gets Template:TNTN noise

See also:

Event Timeline

Cpiral raised the priority of this task from to Needs Triage.
Cpiral updated the task description. (Show Details)
Cpiral added a subscriber: Cpiral.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

@Amire80 you tagged this as Translate but I do not see any change requests for Translate either.

@Amire80 you tagged this as Translate but I do not see any change requests for Translate either.

Do you want some? :)

Can it be just tagged as “related to Translate”?

My annoyance is that this shows up as untriaged for Translate.

We think this ticket/story needs a bit more discussion in order to triage it properly.

It sounds like maybe intitle and prefix results could be weighted differently to show up higher in the results list. but removing things from search might defeat the purpose of the search....

@debt, to solve all the problems listed, present a search domain option based on language, just as we have already done with a search domain based on namespaces. (i.e. a user preference)

With that, counts, and lists, become just as pure, as they are on wikis without Translated versions. Without that maintenance techs cannot list or count template-usage, HTML, and markup as cleanly as CirrusSearch promises. Rather much noise results.

T118278 "Improve Language Identification for use in CirrusSearch" describes switching CirrusSearch tracks toward searching another language. Here it's the same lever but in the opposite direction away from searching another language.

If it helps, i can add in a search keyword inlanguage:en that only returns results in english (or french, or whichver language(s) are chosen). Would that cover the needs here?

FYI. Translate's Special:SearchTranslations calls this keyword language:.

Nemo_bis added a subscriber: Nemo_bis.

This report seems a duplicate of T56832. The first priority is to restore the functioning of T68829#1237871.

so the easy solution is undesirable? will have to go back into the waiting pile then. The short of it is that an explicitly enabled yes/no filter is very easy, while tuning weighting in a search algorithm requires a good bit of analysis and testing to get right.

so the easy solution is undesirable?

A search operator "language:abc" would be ok and consistent with MediaWiki practices, as Nikerabbit indicated. It's ok to morph this task into a task to enable said operator if you want, while leaving the general problem of ranking at T56832/T68829#1237871.

An inlanguage: parameter would solve all the problems listed, yes.

Nemo_bis renamed this task from Translation subpages should not be searched to Allow to search pages in a specific language, e.g. without translations.Sep 1 2016, 10:37 PM

We are going to work on this ticket rather than https://phabricator.wikimedia.org/T121826 right now.

to be consistent with other cirrussearch keywords, inlanguage makes more sense. For keywords that perform filters on the result set we currently have:

  • nearcoord (geo range filter)
  • neartitle (geo range filter)
  • hastemplate
  • incategory
  • intitle
  • linksto
  • insource

Adding inlanguage to this seems to be the most consistent way, although it's a shame that differs from Special:SearchTranslations

Change 312061 had a related patch set uploaded (by EBernhardson):
Add a language based keyword filter

https://gerrit.wikimedia.org/r/312061

Adding inlanguage to this seems to be the most consistent way

I don't see how it's more consistent: the inX:Y operators mean "search Y inside X". The "inlanguage:abc" operator would not literally search "abc" inside anything, from a user perspective.

Following your logic, "haslanguage" would be more consistent, but a single other operator doesn't make a pattern.

there is a pattern to the keywords, in that they are all closed compound words containing both a context and a qualifier. Constructing keywords like this reduces the opportunity for them to be mis-interpreted. To take a random query from our logs that would be misinterpreted:

The question of language: Issue 6 of Congress political and economic studies

Both in and has keywords are the same, the difference is only in how odd they would be to say. intemplate doesn't really make sense, and conveys something slightly different, so it was given the has qualifier instead. linksto could have been called inlinks, why it wasn't is likely lost to history.

In isn't exactly "search Y inside X", as category is a full keyword match filter and not a match to a part of the content like title and source are.

Change 312061 merged by jenkins-bot:
Add a language based keyword filter

https://gerrit.wikimedia.org/r/312061

This was released the week of Oct 4 2016 on the train (after the week of no production pushes)