Page MenuHomePhabricator

[M] Add "did you mean" feature to Media Search
Closed, ResolvedPublic

Description

As a Media Search user, I want a "did you mean" feature to suggest alternative spellings, so that I can find what I'm looking for even if I spell it wrong.

We currently have the "did you mean" feature in the existing Commons search - this ticket is to add it to Media Search.

Design

Screen Shot 2020-08-31 at 2.54.12 PM.png (1×2 px, 2 MB)

The plan is to repurpose what's already in CirrusSearch directly; if that can't be done, we will reconsider this ticket.

Event Timeline

I think following the general existing design works for me but we can make "Did you mean:" in black and the recommended query in blue

Screen Shot 2020-08-31 at 2.54.12 PM.png (1×2 px, 2 MB)

Does the current implementation have any logic around switching to the new spelling automatically if it meets a certain threshold but still allowing you to switch back to the supposed incorrect spelling?

Something like this:

Showing results for: Bristlecone Trees
Search instead for: Bristlecone Treez

@TJones see @mwilliams' question above - do you know if this functionality exists? Thanks!

@TJones see @mwilliams' question above - do you know if this functionality exists? Thanks!

I think it's done in Cirrussearch, rather than deeper down in the stack, but I'm not sure, since it's been around forever and I've never messed with it. It's very simple logic—if there are 0 results and a suggestion, run the suggestion as a new query and show a message: "Showing results for <suggestion>. No results found for <query>."

For example: einstn gets zero results, so we search automatically for the suggestion, einstein and show those results.

We prevent recursive suggestions—sometimes suggestions get no results, and can generate suggestions on their own. (It's rare, and hard to concoct an example, but it does happen.) For example: searching for einstn rutbaga gets the suggestion einstein rutana, which gets no results. But if you click on einstein rutana, you get a fresh search with einstein rutland as the suggestion.

[Note that your suggestion may differ. For such weird queries the suggestions differ on different shards because the word-level statistics are different for these typos and rare combinations of words.]

@TJones Thanks for the context! That is super helpful.

I could imagine a bunch of work related to this in the future but for now just getting the same "Did you mean" functionality that we have in the current Commons search into media search seems to be the next step.

CBogen renamed this task from Add "did you mean" feature to Media Search to [M] Add "did you mean" feature to Media Search.Sep 23 2020, 4:46 PM
CBogen updated the task description. (Show Details)

We're using the search API, which already exposes suggestions (&srinfo=suggestion.) E.g. https://commons.wikimedia.org/w/api.php?action=query&list=search&srsearch=einstn&srnamespace=6&srinfo=suggestion&mediasearch=1
Sadly, it's not exposing any such data when used as generator (which we do), and changing things to add that kind of data likely isn't happening (see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/394120/1#message-1e5569733143789c304edc5d8bc1c7fc17362e15)

Proposed workaround: if (and only if) we do not get search results back (with sroffset=0), then fire a second API call (with list=search instead of generator=search and minimal gsrlimit) to see if there are alternative suggestions; then display those.

@TJones and @EBernhardson, would love your thoughts on the above!

I think matthias has correctly identified the issue, the suggestion can be requested but the current structure of the api throws it away when used to collect more information (such as thumbnails). The right way forward is perhaps a pain, which is why I didn't do it at the time (also i didn't have a concrete use case, just an annoyance), but Anomie didn't say we can't do this, he said:

I think let's have an RFC on whether we should start allowing generators to return data instead of just generating titles.

Blocked by T270381 (which is actual implementation as a result of resolving T263841)

AnneT removed AnneT as the assignee of this task.Jan 26 2021, 11:13 PM
AnneT added a subscriber: AnneT.

I was reading back and realized this question was never fully answered:

Does the current implementation have any logic around switching to the new spelling automatically if it meets a certain threshold but still allowing you to switch back to the supposed incorrect spelling?

Trey described how it works, but not how to use it. Specifically this is the srenablerewrites option for the search api. This defaults to off.

If I wanted to see suggestions locally (in a Vagrant environment that is up to date with the patch for T270381), is there anything I need to do beyond adding gsrinfo=suggestion to the URL I'm using to request search results? Is there a job that needs to be run to generate these suggestions somehow, or any other configuration that needs to be enabled?

For reference, here's what the full URL looks like for a typical MediaSearch API request (for the term "seattle" and "bitmap" media type):

GET http://commons.wiki.local.wmftest.net:8080/w/api.php?action=query&format=json&uselang=en&generator=search&gsrsearch=filetype%3Abitmap|drawing%20seattle&gsrlimit=40&gsroffset=0&gsrinfo=totalhits&prop=info|imageinfo|entityterms&inprop=url&gsrnamespace=6&iiprop=url|size|mime&iiurlheight=180&wbetterms=label&mediasearch=true

IIRC Any expert syntax, such as filetype:bitmap|drawing, disables the suggestion generation. In part this is because the historical query parser extracts keywords from the search string with regexp, and we don't know how to take a suggestion from the search engine and add back on the expert syntax to return a valid suggestion. @dcausse started work towards removing these kind of restrictions, but i'm pretty sure it didn't make it to a place that can work here.

I suppose we've always been trying to come up with a solution that is generally applicable, we might be able to narrow the use case and come up with something that can work here.

We hardcode various values for filetype as the user goes back and forth between the various tabs in the new search UI, so this may be a blocker for now.

Here's one potential solution: the MediaSearch UI stores the term separate from the other parameters like filetype, only combining them at the time an API request is made. It would be easy for us to simply add some additional parameter to the request which just contained the bare search term, so that suggestions could be provided. I don't know if baking this kind of a workaround into the API itself is desirable or practical, though.

If extracting term from expert syntax is a general problem, then such a workaround (as an optional param) may be helpful for other use-cases beyond just MediaSearch.

I did some further testing, and digging around in cirrus, i seem to be mistaken with respect to when we have suggestions available, my appologies I don't work in this codebase nearly as much as I used to. Perhaps we did resolve this case at some point.

We can get general proof that this is possible from regular fulltext search, the search for filetype:jpg medai suggests filetype:jpg media: https://commons.wikimedia.org/w/index.php?search=filetype%3Ajpg+medai

We can get the same suggestion from the api: https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=filetype%3Ajpg%20medai&srinfo=suggestion

Enabling srenablerewrites does not seem to enable the same functionality as the web ui, although it is supposed to. This should return the suggested query results (but does not): https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=filetype%3Ajpg%20medai&srinfo=suggestion&srenablerewrites=1

Using the generator mode is also not returning the suggestion: https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&generator=search&gsrsearch=filetype%3Ajpg%20medai&gsrinfo=suggestion

But performing the exact same request against a different wiki gets suggestions:
https://test.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&generator=search&gsrsearch=filetype%3Ajpg%20medai&gsrinfo=suggestion

testwiki is on wmf.28, likely commonswiki would just have to roll forward from .27.

You are testing against commonswiki, but on a local intsance. I have to assume your local instance is up to date with at least wmf.28? Otherwise more investigation needed.

Interesting, thanks for digging into this. Looks like this will be ready for development once we figure out what's going on with Commons' configuration.

@EBernhardson and @egardner do we need a separate ticket to figure out the Commons' configuration issue and unblock this?

Did some further testing, following up on before:

The problem with autorewrite is it seems to be disabled when expert syntax is used. When using only a word we get the rewrite:

https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=picturree&srnamespace=6&srinfo=suggestion&srenablerewrites=1

But adding the expert syntax turns off autorewrite:
https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=filetype%3Ajpg%20picturree&srnamespace=6&srinfo=suggestion&srenablerewrites=1

I don't think this limitation is necessary anymore. I've submitted a patch for review that removes it, will figure out if there are any blockers.

I tested generator mode with suggestions locally, afaict this is all working. I think the only reason it isn't working on commons.wikimedia.org is that the train was rolled back to wmf.27, but the functionality was added in wmf.28. This should all work in vagrant, it worked on testwiki because testwiki didn't get rolled back, and will soon work on commonswiki.

http://commons.wiki.local.wmftest.net:8080/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=main%20paeg%20intitle%3Amain&srinfo=totalhits%7Csuggestion%7Crewrittenquery&srenablerewrites=1

Nothing special should be required to get these suggestions, just swap a couple characters in a word that you know is in the index and search should return a suggestion for it.

Change 661266 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] Remove simple_bag_of_words restriction on suggestions

https://gerrit.wikimedia.org/r/661266

Thanks @EBernhardson for the update. I tried some more local API requests in Vagrant, but I'm still not able to see any suggestions locally – this is both with and without your patch applied. Our MediaSearch page limits itself to the File namespace for most situations, I wonder if that could have anything to do with this? Alternatively, maybe there is a problem with my search index locally and I need to rebuild it?

Here's a link to the local API Sandbox in Commonswiki that I've been using: http://commons.wiki.local.wmftest.net:8080/wiki/Special:ApiSandbox#action=query&format=json&uselang=en&prop=info%7Cimageinfo%7Centityterms&generator=search&inprop=url&iiprop=url%7Csize%7Cmime&iiurlheight=180&wbetterms=label&gsrsearch=filetype%3Abitmap%7Cdrawing%20seattl&gsrnamespace=6&gsrlimit=40&gsroffset=0&gsrinfo=totalhits%7Csuggestion%7Crewrittenquery&gsrenablerewrites=1

Locally I have a file with a name beginning with Seattle – I tried changing the search term to seattl but I have not been able to get any suggestions so far.

@egardner testing suggestions locally is a bit tricky as it is based on term frequencies and I doubt that a single title with Seattle is enough. I suggest creating at least 10 pages with Seattle in their title.

Change 661266 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Remove simple_bag_of_words restriction on suggestions

https://gerrit.wikimedia.org/r/661266

Thanks @dcausse, that makes sense. I think that I'll just wait until wmf.29 goes out onto Commons (next week hopefully?) to resume development on this. Then I can hit the production API and get the expected response even if my local data is sparse.

Just writing here to confirm that the API on production Commons provides suggestions as expected now that wmf.29 has gone out. Here's an example using the search term "Seattl" that correctly returns the suggestion "Seattle":

https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&uselang=en&prop=info%7Cimageinfo%7Centityterms&generator=search&inprop=url&iiprop=url%7Csize%7Cmime&iiurlheight=180&wbetterms=label&gsrsearch=filetype%3Abitmap%7Cdrawing%20seattl&gsrnamespace=6&gsrlimit=40&gsroffset=0&gsrinfo=totalhits%7Csuggestion%7Crewrittenquery&gsrenablerewrites=1

I can use the production API in local development, so I'm going to pick this task back up. Thanks to @dcausse and @EBernhardson for clarifying why this wasn't working earlier.

Change 663063 had a related patch set uploaded (by Eric Gardner; owner: Eric Gardner):
[mediawiki/extensions/WikibaseMediaInfo@master] Add "did you mean" feature to Media Search

https://gerrit.wikimedia.org/r/663063

Just writing here to confirm that the API on production Commons provides suggestions as expected now that wmf.29 has gone out.

Sadly I spoke too soon, and we're back on wmf.27. I was able to use the updated production Commons API for a few hours today to write a patch, but it got rolled back before I could finish testing. I think this will work correctly once the train goes out again, but anyone reviewing should just be aware that if you try to test the patch using $wgMediaInfoLocalDev = true;, you won't see suggestions. We should probably refrain from merging until we can test things fully, but in the mean time I welcome review comments.

@mwilliams heads up, in addition to the ability to provide an alternative, "did you mean:" search term, the search API also has the ability to rewrite a query that gets zero results into a corrected query.

For example, this search for video results matching the term "hous" can be automatically corrected to show the results for "house" instead.

I was going to try to just include both behaviors into the patch I'm working on, but I think it is a little more complicated than that:

Rewritten queries only happen if a search has zero results. The MediaSearch UI presents a single page to the user, but really each tab is a completely separate query (with different parameters for file type). If we were to enable query rewrites, here's what I think would happen pretty regularly: one tab (say Video or "other") might not have any matches for a given query. Another tab (say, Images) might have a small number. So what happens then? We don't want different tabs to show the results for entirely different search terms – that contradicts the way the UI works in all other circumstances. To make matters worse, the tab with a few results would probably include a suggestion while the tab with rewritten results would already be showing the results for that suggestion. Sounds like a recipe for confusion.

For now, I think we should just show suggestions but not do any rewrites. If we can come up with a good way to handle both, I am happy to add a follow-up patch here.

For now, I think we should just show suggestions but not do any rewrites. If we can come up with a good way to handle both, I am happy to add a follow-up patch here.

+1

@egardner This makes sense, getting rewrites 100% right would obviously be a good thing but that sounds very tricky as you've explained. I also don't think rewrites are a blocker for us making Media Search the default, more of a "really nice to have".

Is there a way for us to know that all the tabs returned 0 results and only do the rewrite then?

Is there a way for us to know that all the tabs returned 0 results and only do the rewrite then?

Not in 1 take. I think the entire routing would have to look something like this:

  • do the original search query (only for a specific tab)
  • if it comes up empty, do one across all tabs (we can't do that straight of the bat because 2 queries for every search term would be very expensive), which - if empty - rewrites
  • if that one does come up with a rewrite, then use the rewritten term to perform another query (narrowed down to the tab the user is looking at)
  • show those results (which again might be empty for that tab)

IMO, that's all pretty complex, probably quite slow, and still no guarantee for results in the tab you're looking at.

Thanks for the explanation @matthiasmullie, sounds like we should avoid attempting to add in the rewrite feature at the moment.

Change 663063 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Add "did you mean" feature to Media Search

https://gerrit.wikimedia.org/r/663063

Etonkovidova added a subscriber: Etonkovidova.

Checked in betalabs - the search term teszing The link in "Did you mean: testing" starts the search for testing.

Screen Shot 2021-02-12 at 5.11.15 PM.png (474×889 px, 53 KB)

To compare with Special:Search (interesting that only tesz (not teszing ) will trigger the warning: "Showing results for test. No results found for tesz"
Screen Shot 2021-02-12 at 4.31.08 PM.png (290×777 px, 39 KB)

Still waiting for wmf.30 deployment - commons is back to wmf.27.

Checked in commons wmf.31 - works as expected, Did you mean is displayed with the suggested correction for a query.

Moving to Design QA since there are minor issues that were missed during initial testing.

@mwilliams - please review the issues below and let me know whether some additional actions (i.e. filing phab tickets) are needed.
(1) There will never be results displayed with Did you mean. So it's not possible to see a page as in your screenshot. For example, the search for Bristlecone Treez will produce empty search page with Did you mean correction. It looks logical to me but since the behavior differs from the task's screenshot, your feedback is needed.

Screen Shot 2021-02-17 at 5.58.37 PM.png (480×848 px, 57 KB)

(2) From the above screenshot, Did you mean discards user input capitalization. Capitalization of words first letter does not affect the search result (Bristlecone Trees = bristlecone trees), but a user might feel forced to use lowercase in their search queries.

Also there is an interesting case when Bristlecone TreEs has no suggestion, but Bristlecone TreEz does.

(3) Did you mean is displayed even when there would be no results if a user clicks on the link.
Bristlecone TreEz +Video -> Did you mean: bristlecone trees-> (click on the link " bristlecone trees") - No results.

(3) Did you mean is displayed even when there would be no results if a user clicks on the link.
Bristlecone TreEz +Video -> Did you mean: bristlecone trees-> (click on the link " bristlecone trees") - No results.

I think that this is unavoidable given the way MediaSearch works. The search backend can figure out if there is a similar term with some matches when the original has none, but it's not checking against the exact same parameters that were previously used (media type, filters, etc). Doing that would mean actually submitting an alternate query in advance, which (I assume) is expensive to do.

Thanks, @Matthias and @egardner - yes, it's interesting that some misspellings do not return any results. And re-checking if there are (with filter selections) any results will be expensive. Overall, all is working as expected.