Page MenuHomePhabricator

Investigation: improved template search
Closed, ResolvedPublic

Description

Investigate options for implementing T271802: Better search - "wildcard word" rather than "title prefix" search, for template search dialogs (Template findability). Currently templates are searched by prefix. We want to make this a fuzzy, full-text search,

The outcome of the investigation should be a set of subtasks of T271802 that can then be estimated and implemented.

Event Timeline

Tobi_WMDE_SW set the point value for this task to 5.Jan 20 2021, 11:16 AM
thiemowmde added a subscriber: thiemowmde.

When searching for a template in VisualEditor, each keystroke triggers two API queries:

MediaWiki-extensions-TemplateWizard does effectively the same, but combines the two queries. Example: https://en.wikipedia.org/w/api.php?action=templatedata&format=json&generator=prefixsearch&gpssearch=dogs&gpsnamespace=10&redirects=true&includeMissingTitles=true&lang=en

Findings:

  • The prefixsearch documentation mentions "profiles", but the more interesting ones (e.g. "fuzzy-subphrases") don't work on enwiki, for example.
  • According to https://www.mediawiki.org/wiki/API:All_search_modules, the only alternative appears to be the "search" API.
  • Compare the "prefixsearch" example above to this one: https://en.wikipedia.org/w/api.php?action=query&formatversion=2&generator=search&gsrsearch=intitle:dogs&gsrnamespace=10&gsrlimit=10. Looks good.
  • One issue I see is that this finds subpages as well. I don't see an obvious way to exclude them, or at least lower their priority. However, this was the same before (example), so we don't make anything worse here.
  • A possible hack to improve the subpage situation is to ask for 20 instead of 10 results, reorder them (move subpages to the end), and only display the first 10 results.
  • It's possible to limit the search space to titles only with the intitle:… keyword. However, this isn't necessarily better. For example, searching for "pinscher" without intitle:… gives quite nice results. Much better as if it would be limited to the title. Usually, the internal ranking algorithm prefers matches in the title and moves them more to the top.
  • It's possible to ask for a different sort order by e.g. adding …&gsrsort=incoming_links_desc to the URL. But this doesn't necessarily improve the result.
  • There is no way to specifically search for the <templatedata> description field. The only way I can think of is something like insource:/"description":\s*"[^"]*dogs/. But this is slow, heavily capped because of this, and can't be used in production.
  • Good news: It doesn't matter if the <templatedata> blob is on a subpage. CirrusSearch indexes the rendered HTML, not the wikitext. Example.

Open questions:

  • How to exclude subpages? Can we introduce a new CirrusSearch feature to do this?
  • Can we lower the priority of subpages? Is it worth tweaking CirrusSearch to do this by default?
  • Can we teach CirrusSearch to prioritize the "description" from the TemplateData blob more, similar to headings and leading paragraphs?
  • Is it possible to force CirrusSearch to list matches in titles first?
Lena_WMDE changed the point value for this task from 5 to 1.Feb 3 2021, 9:46 AM
Lena_WMDE removed the point value for this task.Feb 17 2021, 9:33 AM
thiemowmde claimed this task.

For reference, we got some additional information from the WMF search team in April:

[…] the search platform team […] might be able to add a per-token edge n-gram analysis for template search. In layman terms, that means that users could have suggestions based on the beginning of each word of the name (assuming standard tokenization,.i.e. based on whitespaces). That would solve some users' cases, like searching for a second word in the string, but also would allow matching for beginning of all words (e.g. allowing users not to type full words and still find the templates). […] this change is not minimal in our system (at the very least, it would require adding a new type of search). […]

For now we solved this with a * at the end, which turns it into a prefix search.