Page MenuHomePhabricator

Change template search in VisualEditor to use standard search API
Closed, ResolvedPublic3 Estimated Story Points

Description

Implement the results found in T272457: Investigation: improved template search.

Notes:

  • Don't use any fancy options. No "intitle", no other sort order. The default behavior is usually better, as well as what the users are used to and expect.

Possible follow-up tasks (not part of this ticket):

  • Experiment with and possibly implement a workaround to lower the priority of subpages (e.g. as suggested in T272457).
  • Possibly show a snippet together with the title, especially when the search term doesn't appear in the title.
  • Apply the same changes to MediaWiki-extensions-TemplateWizardT274907.
  • Ask the CirrusSearch devs how to exclude or lower the priority of subpages → T274908.
  • Ask the CirrusSearch devs how to raise the priority of a template's TemplateData description → T274906.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I tested the current version of the patch by @thiemowmde and it already improves a lot. Still there's some minor regression and probably also some details we might want to discuss. See my examples below.

Assume the following templates are present Foo Bar Infobox Football InfoboxFootball FootballInfobox


The original algorithm starts proving results as soon as one letter is in the search field.

  • typing F shows Foo Bar and FootballInfobox

The patch's algorithm starts proving results as soon as three or four (?) letters are in the search field.

  • typing Foo only shows Foo Bar
  • typing Foot shows Infobox Football and FootballInfobox

Might make sense for performance reasons. But is a regression and might be a bit confusing.


The original algorithm only matches the names from the start.

  • typing Foot will not show Infobox Football and InfoboxFootball

The patch's algorithm matches words from the start.

  • typing Foot will not show InfoboxFootball

It might make sense to use more wildcards in the algorithm so that the keywords can be found where ever they appear in the template name. I'm not sure how easy it would be to also support results after the first letter.

@Zbyszko gave us some helpful tips this morning, which can already be applied to this task. The Android app does a hybrid prefix- and full-text- search, apparently falling back if no results are returned: https://github.com/wikimedia/apps-android-wikipedia/blob/88fda5f8c07f0b57e51eb06dfd19cc39433f8dde/app/src/main/java/org/wikipedia/search/SearchResultsFragment.java#L279

We might consider something similar to avoid regressions like the one reported by @WMDE-Fisch

Moving this ticket back to doing to see if we can get the search starting from the first letter with combining prefix- and full-text-search or some other settings.

The current solution of the ticket changes the query from prefixsearch to search. This allows a search in the whole title, but also only displays results for whole word matches (minimum of 4 characters). The original solution displayed matches from the first letter but only from the beginning of the title. I tried to combine both approaches by using the prefixsearch for short queries (smaller than or equal to 4) and the default search for larger strings. See the result in the screencast.

Peek 2021-03-11 18-12.gif (568×746 px, 88 KB)

As you can see this also creates some strange behavior for three character words, but it is the cheapest solution. What do you think, is this sufficient or should we go for another solution? We could try to check for the results of both search methods and display accordingly.

Some approaches we can experiment with:

  1. Depending on the length of the input, do either a prefixsearch or a standard search. While this appears to work in many situations, I have concerns:
    • What exactly is the limit, and why? What if it changes?
    • What if the limit is context-sensitive, e.g. depending on what Unicode characters are used? Think of Chinese, for example. A single Chinese character can be an entire word, and should probably use the standard search, not prefixsearch.
  2. Always do a standard search first. Only if there are no results, do a prefixsearch. Issues:
    • The search result dramatically changes at this point. This is surprising.
    • It's more expensive and slower because 2 consecutive API requests are done in many situations.
  3. Always do both, and combine the results somehow (still limited to 10 in total).
    1. Always ask for e.g. 5 prefixsearch results (or 3, or 2, or 1), and 10 standard search results, and list the first 10 in this order.
      • Advantage: We don't need to care what the priority and order of the 2 different search result sets is. We can just list them in the incoming order.
      • However, we must eliminate duplicates. This should be possible based on the page titles, which act as identifiers in MediaWiki. Even if all prefixsearch results are duplicates, the deduplicated result is still guaranteed to contain 10.(*)
      • Advantage: Prefix matches are always listed first. These don't change dramatically while typing.
      • Disadvantages: Expensive and slow because we always do 2 queries.
    2. Ask for 10 standard search results, and if it's not exactly 10, ask for 10 additional prefixsearch results. Remove duplicates. List the first 10 in their incoming order (possibly prioritizing prefixsearch results).
      • Advantage: It's always 10 results.(*)
      • Advantage: It's often only 1 query.
      • Disadvantage: There is still a point where the search result dramatically changes.
      • Disadvantage: There are many situations where not a single prefixsearch result will be shown. However, this is not worse than using the standard search only.
    3. Asking for 10 prefixsearch results and listing them first won't work. It means there is a chance we will never see e.g. "Infobox building" when searching for "building". Also: I think it's not possible to exclude subpages from prefixsearch (see T274908).
  4. Add intitle:… into the mix (documentation). You can think of this as a mixture between the other 2 approaches. While it is limited to the title only – just as prefixsearch – it is not limited to the start of the title. However, I think this will not help. Standard search already favors page titles. And we want it to search the page content.
  5. … what else?

(*) Assuming there are enough pages to find.

After some more discussion of the different approaches we (@thiemowmde and me) came up with the following plan to combine the advantages of both search types:

We would like to keep using search instead of prefixsearch (with * in CirrusSearch, results show up from the first letter). And to keep the search results mostly the same we adapt search to display the results ranked by usages, similar to prefixsearch (example for api call).

Unfortunately the ranking by usages does not guarantee the display of the exact match (somewhere at the top or at all). To solve this we would like to add a simple query for an exact title match and add the result to the top of the result list (example api call). We obviously make sure it never shows up as a duplicate.

Wow appreciate the detailed breakdown! But also where you landed in the end sounds great to me. It seems like it will combine the important part of each to hopefully show a much improved result list.

We would like to deploy the current solution (with search and *). It is safely hidden behind a feature flag. This way we can test on the beta server if the results are already ok. As a fallback solution for installations that do not use cirrus search we could - at a later point - keep the feature flag or add a cirrus check at the same place.

I know that this ticket is on pause at the moment, but wanted to add an idea here before I forget. I was looking through the different betas and found https://simple.wikipedia.beta.wmflabs.org/wiki/Main_Page, where the full simple english wikipedia was imported. Because it's more of a 'complete' wikipedia, this might be a good place to test the search when we get there.

@Lena_WMDE: https://docs.google.com/document/d/1SCtl31pMchiUkJV1iPdPUhUNyKzDcZoVyH37pSldSTA#heading=h.acnh93ezs5as already contains a comparison matrix. I tried to come up with a more complex matrix that incorporates everything written and said so far.

Approaches:

  1. Prefixsearch (status quo)
  2. CirrusSearch only with intitle:…
  3. CirrusSearch with no modifications
  4. CirrusSearch with a * always added to the end
  5. CirrusSearch, ordered by number of backlinks (Warning, this does have a major disadvantage, described in T274908#7103304.)
  6. CirrusSearch plus a single guaranteed exact match at the top
  7. Combine both prefixsearch + CirrusSearch
Approach1.2.3.4.5.6.7.
Exact match guaranteed to be foundyesunclear (a)unclear (a)unclear (a)NO yesNO
Works with special characters, e.g. Template:!!yesNO yesNO (b)when combined with 3.when combined with 3.sometimes
Works with underscores instead of spacesyesyesyesNOdependsyesyes
Search in template description/docsNONOyesyesyesyessometimes
Stable, predictable orderyeskind of (c)kind of (c)kind of (c) yesonly for the exact match (c)NO
Prefix search (e.g. searching for "colla" will find "collaboration")yesNONO yeswhen combined with 4.when combined with 4.when combined with 4.
Infix search (e.g. searching for "breed" will find "dogbreed")NOsometimes (d)sometimes (d)sometimes (d)sometimes (d)sometimes (d)sometimes (d)
Stemming (e.g. searching for "collaborating" will find "collaboratively")NOyesyesyesyesyessometimes
Subpages are ignored or listed very lowNONONONO yesNONO

Remarks:

  • Key advantages.
  • (a) Should typically be found, but we can't guarantee this. E.g. searching for "city" might find 10 infobox templates that contain the word "city" and are considered highly relevant by CirrusSearch, but the template that's literally called Template:City is found at position 11 and doesn't show up because of this. Approach 6. fixes this.
  • (b) I think we can fix this when we add the * only when the search ends with a character. We should look into the CirrusSearch code to understand the current behavior.
  • (c) CirrusSearch results are stable, but can change dramatically while typing. Much more dramatically than the prefixsearch results do.
  • (d) When searching for "breed", CirrusSearch will find "DogBreed", but not "dogbreed".
Tobi_WMDE_SW added a subscriber: Tobi_WMDE_SW.

removing Unplanned-Sprint-Work as we agreed on to continue work here during the planning.

We're taking next steps to make the most incremental change: the autocomplete will use the full-text search API instead of prefix search, with no special edge case handling or fallbacks.

Change 667641 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Use standard search API when searching for templates

https://gerrit.wikimedia.org/r/667641

Bug found during demo:

  • If you write a search term, then type a space after, you get a totally different list of results

Know issue:

  • Typing "!" shows no results, need to add exact match to see this

Problem to discuss:

  • Redirects work very differently than prefix search, won't necessarily show at the top of the list (and therefore in the list of 10 results). We would like to see if we can improve how this works to show redirects in the list in a follow-up ticket - @Lena_WMDE probably makes sense to review this. We talked for awhile about it

Change 689070 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/VisualEditor@master] Add start to template search term only when it's possible

https://gerrit.wikimedia.org/r/689070

thiemowmde set the point value for this task to 3.May 12 2021, 12:22 PM

I suggest these next steps:

  • Investigate the actual behavior of redirects, and compare with the behavior before. It might be a no-op. Just make sure it works as expected.
  • Document the behavior with a space at the end, and list possible solutions (I see at least 2).
  • Start implementing the "exact match" approach (number 3 in T274903#6964804). It looks like this is the only solution for the problem that we currently can't find templates like {{!!}}.

I had a closer look at #1: redirect resolution. Here is a comparison of the two search APIs when the search term is the name of a redirect.

Prefixsearch

…&generator=prefixsearch&gpssearch=NHLE

  • Prefixsearch returns 2 data structures: A list of pages where redirects are already resolved. And a separate data structure with the involved redirects.
  • The dropdown in VE shows both the redirect target and source as separate entries. The entries are sorted so that the redirect target and source are always together. This is done by using the "index" provided by the API.
  • The code for this is not in VE but in the mw.widgets.TitleWidget class in core.
  • It's possible to pick either the actual search result or the redirect from the list. When picking the redirect, this one will appear in the wikitext.
  • There is no concept of "snippets" in prefixsearch. The highlighting of the search term is done in the OO.ui.LabelElement mixin.

Screenshot from 2021-05-18 16-04-31.png (266×358 px, 21 KB)

CirrusSearch

…&generator=search&gsrsearch=NHLE*

  • The "redirects" flag doesn't have any effect on CirrusSearch. It acts as if the flag is always on.
  • CirrusSearch acts as if the redirect is an alias for the actual page. It just finds the actual page.
  • The relevant data structure with the actual search results is compatible across the two APIs. That's why we can swap them out so easily.
  • For what's relevant for this investigation both APIs behave identical when searching for redirects.
  • The main difference is the way redirects appear in the API response. Because of this redirects aren't shown as separate entries in the VE dropdown. We can add code that either supports the "redirecttitle" property or converts it to match the prefixsearch data structure.
  • There is also an "index", just as in prefixsearch, but this one is very confusing. The numbers are all over the place. What is the relevant order according to the CirrusSearch ranking? The index or the natural order in the JSON?
  • There are also prepared snippets with the search term already highlighted, called "redirectsnippet". However, we don't need to use this.
  • There is also "titlesnippet", but it doesn't work for some reason.

TL;DR:

  • Decide if we want redirects to show up as separate entries. This has several consequences, like forcing users to resolve redirects.
  • Decide if we do a cheap conversion to the prefixsearch data structure, or implement full support for the CirrusSearch API.
  • Make sure the sorting order makes sense.

Otherwise it's a no-op.

Let's have a closer look at #2 as well: when your search term ends with a space.

With the current prefixsearch API, whitespace is not trimmed. You can try this for yourself: type slowly, e.g. sweden and then sweden with a space. Compare the search results. They are different.

With CirrusSearch, we can:

  1. Just trim all whitespace before adding the * at the end. With this sweden and sweden will show the same results. However, I feel like this is a missed opportunity. The space is most probably there for a reason, either because the user is in the middle of typing a longer template name, or because it's a copy-pasted snippet from somewhere. In both cases the space does have a meaning: We know "sweden" is not some cut-off prefix but a full word.
  2. Don't add the * if the search term ends with whitespace. With this, the search results for sweden and sweden might be quite different because we search for a prefix in one case but for a full word in the other. While this might surprise users, it's very likely helpful because the later search result is more meaningful (when we assume the space does have a meaning). Additionally, when the user continues to type, the search result for sweden and e.g. sweden fi are much closer together. In both cases "sweden" is considered a full word.
  3. Just always add the *, resulting in e.g. sweden *. For CirrusSearch this means "find pages that contain the word sweden and any other word". This is rather meaningless. The result is the same, as if the * is ignored. However, we can make it easier for CirrusSearch when we don't add the * when we know it's pointless.

https://gerrit.wikimedia.org/r/689070 implements the 2nd approach.

Change 692930 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/VisualEditor@master] [WIP] Show redirects as part of description in template search

https://gerrit.wikimedia.org/r/692930

Outcome of today's story time:

Screenshot from 2021-05-20 11-03-25.png (479×606 px, 63 KB)

Thanks for summary @thiemowmde ! Would you mind posting a screenshot of how the redirect looks with the description at the moment?

See above. This is all default OOUI style, except for the extra line break. One detail that feels a bit off is how close the 2 gray lines are together. I'm not sure if we should change this. The main benefit I can see is that the list items are not that different in size, no matter if they contain 1, 2, or 3 lines.

Thanks for the screenshot - I think it's looking really good! And I think being able to show the redirects will be very helpful.

Would it be possible to style the redirect text in italics and leave the rest as is? I agree there is a benefit to the lines being close together, but if there is italics then it will be clearer at a glance which is the description.

Thanks - looks good! I think it's a definite improvement. I also like that you took out the "Template:" namespace prefix

Change 689070 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Add star to template search term only when it's possible

https://gerrit.wikimedia.org/r/689070

Screenshot from 2021-05-20 18-01-43.png (471×597 px, 63 KB)

Somehow for me this does not work like that with the current version (#4) of the patch. Let me elaborate:

I've got at Template:Test and a Template:RedirectTest that redirects to that. So an example API query when using the editor could look like this:

api.php?action=query&format=json&formatversion=2&prop=info|pageprops&generator=search&ppprop=disambiguation&redirects=true&gsrsearch=RedirectTes*&gsrnamespace=10&gsrlimit=10&gsrprop=redirecttitle

The results I get from this query looks like this:

{
  "batchcomplete": true,
  "query": {
    "redirects": [
      {
        "ns": 10,
        "title": "Template:RedirectTest",
        "pageid": 2199,
        "index": 1,
        "from": "Template:RedirectTest",
        "to": "Template:Test"
      }
    ],
    "pages": [
      {
        "pageid": 2034,
        "ns": 10,
        "title": "Template:Test",
        "index": 1,
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2021-04-22T11:00:20Z",
        "lastrevid": 12615,
        "length": 980
      }
    ]
  }
}

Note, that there's no redirecttitle in the result as expected by the code of the patch...

Note, that there's no redirecttitle in the result as expected by the code of the patch...

Seems like this has to do with configuration. On the beta cluster the results looks as expected.

https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&format=json&formatversion=2&prop=info|pageprops&generator=search&ppprop=disambiguation&gsrsearch=RedirectTest*&gsrnamespace=10&gsrlimit=10&gsrprop=redirecttitle

Note, that there's no redirecttitle in the result as expected by the code of the patch...

Seems like this has to do with configuration. On the beta cluster the results looks as expected.

https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&format=json&formatversion=2&prop=info|pageprops&generator=search&ppprop=disambiguation&gsrsearch=RedirectTest*&gsrnamespace=10&gsrlimit=10&gsrprop=redirecttitle

Seems my CirrusSearch is not setup correctly somehow. - Sorry for the spam here...

The new features documented in T274903#7098893 and anything else that's been added should all be copied to the task description for easier reference.

Change 693379 had a related patch set uploaded (by Svantje Lilienthal; author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/VisualEditor@master] Move exact matches to the top in template search

https://gerrit.wikimedia.org/r/693379

Change 693157 had a related patch set uploaded (by Svantje Lilienthal; author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/VisualEditor@master] Guarantee exact match when searching for a template

https://gerrit.wikimedia.org/r/693157

Change 692930 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Show redirects as part of description in template search

https://gerrit.wikimedia.org/r/692930

The first patch with redirects is live on beta. I could briefly confirm two things working fine:

  • Redircets are working as shown in the screenshot above. I've got a template on English beta RedirectTest that redirects to 3DForm
  • Search/autocomplete if working starting from the 1st letter

Completing this work might resolve T53822: VisualEditor: Transclusion editor template search should additionally find in-string matches (the example given there is: If I type "book", template "Cite book" is not shown – and it looks like Template:Cite book is the 2nd result when searching for "book" in Template namespace on English Wikipedia).

FYI I could also confirm on beta that this issue would be fixed with changes already done for this ticket. Cite book template exists on English beta, typing book into the template search gives Cite book as result.

Change 693379 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Move exact matches to the top in template search

https://gerrit.wikimedia.org/r/693379

Change 693157 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Guarantee exact match when searching for a template

https://gerrit.wikimedia.org/r/693157

Lena_WMDE claimed this task.
Lena_WMDE moved this task from Demo to Done on the WMDE-TechWish-Sprint-2021-05-26 board.

@Lena_WMDE: I think we can consider this done now. The only tweak from the matrix at T274903#6964804 we ended not implementing is to strictly order the search results by number of backlinks. This appears to have major disadvantages, see T274908#7103304. In case we still want to experiment with this idea, let's please create a new ticket.

Change 701089 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/VisualEditor@master] Extract MWTemplateTitleInputWidget.addExactMatch into a method

https://gerrit.wikimedia.org/r/701089

Change 701089 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Extract MWTemplateTitleInputWidget.addExactMatch into a method

https://gerrit.wikimedia.org/r/701089

Change 701526 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/VisualEditor@master] Add missing search result limitation to template search

https://gerrit.wikimedia.org/r/701526

Change 701526 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] Add missing search result limitation to template search

https://gerrit.wikimedia.org/r/701526