Page MenuHomePhabricator

Allow to search pages in a specific language, e.g. without translations
Closed, ResolvedPublic

Description

For sites that install the Translate extension, searching translation-subpage content is problematic:

  • regex crawl through subpages
  • hastemplate usage-counts are skewed (if they worked, see T125926)
  • namespace counts are skewed. For example, on MediaWiki.org there are not 5100 uniquely defined template usage definitions (Template:)
  • intitle results are drowned out
  • prefix results are drowned out
  • where translations are in-progress, word and phrase searches get Search noise, and whatLinksHere gets Template:TNTN noise

See also:

Event Timeline

Cpiral created this task.Feb 5 2016, 5:49 AM
Cpiral raised the priority of this task from to Needs Triage.
Cpiral updated the task description. (Show Details)
Cpiral added a subscriber: Cpiral.
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 5 2016, 5:49 AM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Amire80 set Security to None.
Quiddity updated the task description. (Show Details)Feb 10 2016, 10:59 PM
Quiddity awarded a token.
EBernhardson moved this task from Needs triage to Search on the Discovery board.Feb 11 2016, 11:19 PM

@Amire80 you tagged this as Translate but I do not see any change requests for Translate either.

@Amire80 you tagged this as Translate but I do not see any change requests for Translate either.

Do you want some? :)

Can it be just tagged as “related to Translate”?

My annoyance is that this shows up as untriaged for Translate.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 14 2016, 2:00 PM
debt added a subscriber: debt.Jun 14 2016, 10:17 PM

We think this ticket/story needs a bit more discussion in order to triage it properly.

debt added a comment.Jul 21 2016, 10:16 PM

It sounds like maybe intitle and prefix results could be weighted differently to show up higher in the results list. but removing things from search might defeat the purpose of the search....

Cpiral added a comment.EditedJul 26 2016, 9:34 PM

@debt, to solve all the problems listed, present a search domain option based on language, just as we have already done with a search domain based on namespaces. (i.e. a user preference)

With that, counts, and lists, become just as pure, as they are on wikis without Translated versions. Without that maintenance techs cannot list or count template-usage, HTML, and markup as cleanly as CirrusSearch promises. Rather much noise results.

T118278 "Improve Language Identification for use in CirrusSearch" describes switching CirrusSearch tracks toward searching another language. Here it's the same lever but in the opposite direction away from searching another language.

If it helps, i can add in a search keyword inlanguage:en that only returns results in english (or french, or whichver language(s) are chosen). Would that cover the needs here?

FYI. Translate's Special:SearchTranslations calls this keyword language:.

Nemo_bis added a subscriber: Nemo_bis.

This report seems a duplicate of T56832. The first priority is to restore the functioning of T68829#1237871.

EBernhardson added a comment.EditedJul 29 2016, 3:44 PM

so the easy solution is undesirable? will have to go back into the waiting pile then. The short of it is that an explicitly enabled yes/no filter is very easy, while tuning weighting in a search algorithm requires a good bit of analysis and testing to get right.

so the easy solution is undesirable?

A search operator "language:abc" would be ok and consistent with MediaWiki practices, as Nikerabbit indicated. It's ok to morph this task into a task to enable said operator if you want, while leaving the general problem of ranking at T56832/T68829#1237871.

An inlanguage: parameter would solve all the problems listed, yes.

debt triaged this task as Normal priority.Sep 1 2016, 10:21 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
Nemo_bis renamed this task from Translation subpages should not be searched to Allow to search pages in a specific language, e.g. without translations.Sep 1 2016, 10:37 PM
debt moved this task from This Quarter to Up Next on the Discovery-Search board.Sep 12 2016, 4:11 PM

We are going to work on this ticket rather than https://phabricator.wikimedia.org/T121826 right now.

debt moved this task from Up Next to Current work on the Discovery-Search board.Sep 20 2016, 5:41 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

to be consistent with other cirrussearch keywords, inlanguage makes more sense. For keywords that perform filters on the result set we currently have:

  • nearcoord (geo range filter)
  • neartitle (geo range filter)
  • hastemplate
  • incategory
  • intitle
  • linksto
  • insource

Adding inlanguage to this seems to be the most consistent way, although it's a shame that differs from Special:SearchTranslations

Change 312061 had a related patch set uploaded (by EBernhardson):
Add a language based keyword filter

https://gerrit.wikimedia.org/r/312061

Nemo_bis added a comment.EditedSep 22 2016, 5:44 PM

Adding inlanguage to this seems to be the most consistent way

I don't see how it's more consistent: the inX:Y operators mean "search Y inside X". The "inlanguage:abc" operator would not literally search "abc" inside anything, from a user perspective.

Following your logic, "haslanguage" would be more consistent, but a single other operator doesn't make a pattern.

there is a pattern to the keywords, in that they are all closed compound words containing both a context and a qualifier. Constructing keywords like this reduces the opportunity for them to be mis-interpreted. To take a random query from our logs that would be misinterpreted:

The question of language: Issue 6 of Congress political and economic studies

Both in and has keywords are the same, the difference is only in how odd they would be to say. intemplate doesn't really make sense, and conveys something slightly different, so it was given the has qualifier instead. linksto could have been called inlinks, why it wasn't is likely lost to history.

In isn't exactly "search Y inside X", as category is a full keyword match filter and not a match to a part of the content like title and source are.

Change 312061 merged by jenkins-bot:
Add a language based keyword filter

https://gerrit.wikimedia.org/r/312061

debt closed this task as Resolved.Oct 7 2016, 9:06 PM

This was released the week of Oct 4 2016 on the train (after the week of no production pushes)