Page MenuHomePhabricator

How to exclude or lower the priority of subpages in search results?
Closed, ResolvedPublic

Description

We found this is relevant when searching for templates. Templates usually include their documentation from /doc subpages. As of now, CirrusSearch finds both the template page as well as the subpage, because they contain the same text.

We tried -intitle:"/", but this doesn't work (probably because the slash is not indexed as a separate term, or not indexed at all).

Something like -intitle:"doc" works, but doesn't scale:

  • It will also exclude templates that actually contain this word.
  • The naming convention for such subpages is different per wiki, e.g. /Doku in dewiki.
  • Templates often contain many more subpages, e.g. forks, tests, and whatnot. We want them all lower in priority.

How to solve this in a way that works for all wikis?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

A different use case I have experienced on non-language wikis (metawiki, mediawikiwiki) was the noise in the results created by listed translation (sub)pages.

A possible partial workaround is to use the intitle: regex feature to exclude titles with a slash in them.

So, Template:PD Help Page gets 184 results on MediaWiki, but Template:PD Help Page -intitle:/\/./ only gets 4.

For some reason -intitle:/\//, which should be "doesn't contain a slash" generates a syntax error. Adding the extra period, which matches anything, is acceptable syntax.

This approach isn't perfect, and it can fail in at least two ways:

  • Depending on the wiki and the size of the index, the regex portion could timeout and you could get incomplete results. As with any regex search, including non-regex parts that greatly reduce the amount of text that needs to be scanned by the regex is the way to go. intitle: regexes are less likely to timeout than insource: regexes, and MediaWiki regexes are less likely to timeout than, say, enwiki regexes.
  • This won't give the desired results if there is a redirect with a slash in it—which is rare, but possible. For example, the page Extension:FormelApplet has a redirect, Extension:FormelApplet/1.0n, and so the query Extension:FormelApplet -intitle:/\/./ returns 0 results.

Despite these limitations—and depending on your use case—this approach may be of immediate help.

MPhamWMF triaged this task as Medium priority.Mar 22 2021, 3:08 PM
MPhamWMF moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

The naming convention for such subpages is different per wiki, e.g. /Doku in dewiki.

TemplateData already has the message MediaWiki:Templatedata-doc-subpage which is customisable per wiki, so if you wish to exclude doc pages with intitle, that is one solution (although it is still not ideal, since in some projects template names can probably contain the generic name for doc subpage).

Also, some projects might have widely used templates in subpages, although that is not the best naming strategy. Therefore I’d actually say that TemplateData, more generally, lacks a way to say which templates are for use in main space and which are not. But that is out of scope for this task.

thiemowmde claimed this task.

Summarizing what we found so far:

  • VE already uses templatedata-doc-subpage to exclude doc subpages. But only these.
  • -intitle:/\/./ works but – among other issues – excludes to much, e.g. when a subpage is meant to be a template.
  • gsrsort=incoming_links_desc (or gsrqiprofile=popular_inclinks, which appears to behave the same) ranks subpages last because they are almost unused, compared to actual templates. Problem: This disables all other useful ranking criteria, like considering where on the page the search term appears. It lists much more pages that don't contain the user's input in the template name – which is not what we want.

Additionally, we got some more information from the WMF search team in April:

[…] the search platform team [is] in a process to document how to interact with our Weighted Tags feature, that allows adding an external flag to any document to make it filterable (which would help with marking the proper template pages):T277275: Document how to publish new weighted tags to CirrusSearch. This feature also allows assigning an arbitrary "weight" to the document, to help with potential ranking - which could be useful for you, since you have the popularity data for the templates. And useful in general, since I verified that our standard scoring mechanism will not work properly for templates.

We might experiment with different rankings as part of T274903, but probably stick to the default.