Investigation: improved template search
Closed, ResolvedPublic
Actions

Description

Investigate options for implementing T271802: Better search - "wildcard word" rather than "title prefix" search, for template search dialogs (Template findability). Currently templates are searched by prefix. We want to make this a fuzzy, full-text search,

The outcome of the investigation should be a set of subtasks of T271802 that can then be estimated and implemented.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		thiemowmde	T296471 Remove updated feature message about search
Resolved		WMDE-Fisch	T303802 Deploy template search improvements to enwiki
Open	Feature	None	T257738 Allow TemplateWizard to search template descriptions
Resolved		None	T302857 Deploy first template focus-area improvements to enwiki
Resolved		WMDE-Fisch	T286990 Deploy template search improvements, back button+warning message, and delete button to all wikis (except enwiki)
Resolved		WMDE-Fisch	T284553 Deploy template search improvements, back button+warning message, and delete button to small set of wikis
Resolved		Lena_WMDE	T271802 Better search - "wildcard word" rather than "title prefix" search, for template search dialogs (Template findability)
Resolved		Lena_WMDE	T274903 Change template search in VisualEditor to use standard search API
Resolved		thiemowmde	T272457 Investigation: improved template search

Event Timeline

Lena_WMDE created this task.Jan 20 2021, 9:53 AM

Lena_WMDE moved this task from Backlog to Ready for pickup on the WMDE-Templates-FocusArea board.

Lena_WMDE moved this task from Ready for pickup to In sprint on the WMDE-Templates-FocusArea board.

Tobi_WMDE_SW set the point value for this task to 5.Jan 20 2021, 11:16 AM

• Esanders moved this task from To Triage to Triaged on the VisualEditor board.Jan 20 2021, 1:06 PM

matmarex added a project: Editing-team (Tracking).Jan 28 2021, 4:35 PM

When searching for a template in VisualEditor, each keystroke triggers two API queries:

It uses the "prefixsearch" API as a generator. This returns only the titles of the pages, not the description. Example: https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=info%7Cpageprops&generator=prefixsearch&gpssearch=dogs&gpsnamespace=10&gpslimit=10&ppprop=disambiguation&redirects=true
To get the description as well, the "templatedata" API is used. Example: https://en.wikipedia.org/w/api.php?action=templatedata&format=json&formatversion=2&titles=Template%3ADogs%7CTemplate%3ADogs%20series%7CTemplate%3ADogs%20on%20Acid&includeMissingTitles=1&lang=en

MediaWiki-extensions-TemplateWizard does effectively the same, but combines the two queries. Example: https://en.wikipedia.org/w/api.php?action=templatedata&format=json&generator=prefixsearch&gpssearch=dogs&gpsnamespace=10&redirects=true&includeMissingTitles=true&lang=en

Findings:

The prefixsearch documentation mentions "profiles", but the more interesting ones (e.g. "fuzzy-subphrases") don't work on enwiki, for example.
According to https://www.mediawiki.org/wiki/API:All_search_modules, the only alternative appears to be the "search" API.
Compare the "prefixsearch" example above to this one: https://en.wikipedia.org/w/api.php?action=query&formatversion=2&generator=search&gsrsearch=intitle:dogs&gsrnamespace=10&gsrlimit=10. Looks good.
One issue I see is that this finds subpages as well. I don't see an obvious way to exclude them, or at least lower their priority. However, this was the same before (example), so we don't make anything worse here.
A possible hack to improve the subpage situation is to ask for 20 instead of 10 results, reorder them (move subpages to the end), and only display the first 10 results.
It's possible to limit the search space to titles only with the intitle:… keyword. However, this isn't necessarily better. For example, searching for "pinscher" without intitle:… gives quite nice results. Much better as if it would be limited to the title. Usually, the internal ranking algorithm prefers matches in the title and moves them more to the top.
It's possible to ask for a different sort order by e.g. adding …&gsrsort=incoming_links_desc to the URL. But this doesn't necessarily improve the result.
There is no way to specifically search for the <templatedata> description field. The only way I can think of is something like insource:/"description":\s*"[^"]*dogs/. But this is slow, heavily capped because of this, and can't be used in production.
Good news: It doesn't matter if the <templatedata> blob is on a subpage. CirrusSearch indexes the rendered HTML, not the wikitext. Example.

Open questions:

How to exclude subpages? Can we introduce a new CirrusSearch feature to do this?
Can we lower the priority of subpages? Is it worth tweaking CirrusSearch to do this by default?
Can we teach CirrusSearch to prioritize the "description" from the TemplateData blob more, similar to headings and leading paragraphs?
Is it possible to force CirrusSearch to list matches in titles first?
…

awight updated the task description. (Show Details)Feb 1 2021, 9:29 AM

Lena_WMDE edited projects, added WMDE-TechWish (Sprint-2021-02-03); removed WMDE-TechWish (Sprint-2021-01-20).Feb 3 2021, 9:41 AM

Lena_WMDE moved this task from Sprint Backlog to Doing on the WMDE-TechWish (Sprint-2021-02-03) board.

Lena_WMDE changed the point value for this task from 5 to 1.Feb 3 2021, 9:46 AM

thiemowmde mentioned this in T274903: Change template search in VisualEditor to use standard search API.Feb 16 2021, 4:30 PM

thiemowmde added a parent task: T274903: Change template search in VisualEditor to use standard search API.

thiemowmde mentioned this in T274906: Prioritize TemplateData description when indexing/searching.Feb 16 2021, 4:41 PM

thiemowmde moved this task from Doing to Demo on the WMDE-TechWish (Sprint-2021-02-03) board.Feb 17 2021, 9:13 AM

thiemowmde added a project: WMDE-TechWish-Sprint-2021-02-17.

thiemowmde moved this task from Sprint Backlog to Demo on the WMDE-TechWish-Sprint-2021-02-17 board.

Lena_WMDE removed the point value for this task.Feb 17 2021, 9:33 AM

For reference, we got some additional information from the WMF search team in April:

[…] the search platform team […] might be able to add a per-token edge n-gram analysis for template search. In layman terms, that means that users could have suggestions based on the beginning of each word of the name (assuming standard tokenization,.i.e. based on whitespaces). That would solve some users' cases, like searching for a second word in the string, but also would allow matching for beginning of all words (e.g. allowing users not to type full words and still find the templates). […] this change is not minimal in our system (at the very least, it would require adding a new type of search). […]

For now we solved this with a * at the end, which turns it into a prefix search.

Investigation: improved template searchClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigation: improved template search
Closed, ResolvedPublic
Actions

Related Objects
Search...