Page MenuHomePhabricator

Commons image: when pasting the exact title, get the correct file first in the suggester
Closed, ResolvedPublic

Description

When entering an image from Commons as a value for P18, and copy/pasting the exact name of a file, the correct file should be the only one (or at least the first one) appearing in the list of suggestions. That's not the case for some files when there are other files with a similar name.

Method:

  • I add a new statement with P18
  • I click on the value field and copy/paste the title of the file (without File:)
  • I observe what appears in the suggestions menu

Tested on the sandbox with different files that are all part of a numbered list:

  • Barcelonnette - Villa du Parc du Mercantour -984.jpg -> appears at the end of the list
  • Barcelonnette - Villa du Parc du Mercantour -985.jpg -> appears at the end of the list
  • Barcelonnette - Villa du Parc du Mercantour -991.jpg -> appears at the end of the list
  • Barcelonnette - Villa du Parc du Mercantour -984 -> the file doesn't appear at all in the list
  • Nature La Réunion, janvier 2018 68.jpg -> is the only one in the list
  • King Edward Rd, Douglas, Isle of Man - panoramio (1).jpg -> is the only one in the list

Screenshot from 2018-06-01 16-15-36.png (564×955 px, 135 KB)

Event Timeline

Vvjjkkii renamed this task from Commons image: when pasting the exact title, get the correct file first in the suggester to aubaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from aubaaaaaaa to Commons image: when pasting the exact title, get the correct file first in the suggester.Jul 2 2018, 1:44 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

This issue was mentioned again here, with the case that the correct image doesn't show up at all in the suggester's list. Could we look at it again?

Addshore subscribed.

Sounds like this could do with a little bit of tech investigation to figure out what is happening.

Addshore triaged this task as Medium priority.Feb 6 2020, 10:02 AM
Addshore moved this task from Needs Tech Work to Unconnected Stories on the Wikidata-Campsite board.

I figure this is prioritized on the product side, so moving it to the right place ready for pick up at some point.

The Property:P18 input box performs a background search via commons API, using the "query" action. That search seems to be rather fuzzy and returns a list of items ordered by relevance, based on incoming links, templates, language etc. Apparently there is no easy configuration of request parameters (such as srqiprofile, srwhat, srsort, ...) that puts exact title matches on top of the returned list, while leaving the fuzzy search results in. However, exact string search is supported and the user can easily work around the issue by wrapping their "search string" in double-quotes.

On a side note, the search box in the upper right corner of commons.wikimedia.org also performs a background search via commons API, using the "opensearch" action. That one only searches the titles and correctly returns a single result when given an exact file name. However, that's probably not desirable in this use case here.

The Property:P18 input box performs a background search via commons API, using the "query" action. That search seems to be rather fuzzy and returns a list of items ordered by relevance, based on incoming links, templates, language etc. Apparently there is no easy configuration of request parameters (such as srqiprofile, srwhat, srsort, ...) that puts exact title matches on top of the returned list, while leaving the fuzzy search results in. However, exact string search is supported and the user can easily work around the issue by wrapping their "search string" in double-quotes.

On a side note, the search box in the upper right corner of commons.wikimedia.org also performs a background search via commons API, using the "opensearch" action. That one only searches the titles and correctly returns a single result when given an exact file name. However, that's probably not desirable in this use case here.

Can we use another API endpoint in the frontend instead? Like "opensearch".

I'm not 100% sure how the P‍18 search is configured, but I think I see generally what's going on. The search is using basically the same search as you get on the Commons Special:Search page, possibly with restriction to the File: namespace, and maybe with an additional hack or two.

As such, it is interpreting the input string as search syntax. I searched for dog and then negated one word in the first result in each new result set until I got to this: dog -racing -chien -reflection -stick -unid -natural -drone -oak. Obviously that's unlikely to match any file names.

Thus, any partial file names like this one—Barcelonnette - Villa du Parc du Mercantour -984—are not going match very well because the -984 is interpreted as not 984, thus excluding the desired file. I think there is an additional hack added so that the full exact file name Barcelonnette - Villa du Parc du Mercantour -984.jpg matches and is tacked onto the end.

This issue was mentioned again here, with the case that the correct image doesn't show up at all in the suggester's list. Could we look at it again?

The discussion has since been moved to an archive page here. The search in question is Eglise Notre-Dame-de-l'Assomption.JPG.

Here I think the problem is that there are too many files with almost the exact same file name (and many more with some variation of "eglise notre dame de l'assomption" as part of the file name):

  • File:Eglise Notre Dame de l'Assomption.JPG
  • File:Église Notre-Dame-de-l'Assomption.jpg
  • File:Église Notre Dame de l'Assomption.jpg
  • File:Église Notre-Dame-de-l'assomption.JPG
  • File:Église Notre dame de l'assomption.jpg
  • File:Eglise Notre-Dame de l'Assomption.jpg

After parsing for search, these are all identical.

If you require every word in the title ( intitle:Eglise intitle:Notre intitle:Dame intitle:de intitle:l'Assomption intitle:jpg ) you get over 3500 results. That's more than you'd get if you search for "bobby" in the title (less than 2700)—and I would not be surprised if a search for "bobby" failed to return one specific desired result as the top hit—especially if we had files named Bobby.jpg, bobby.JPG, BOBBY.JPG, BoBbY.JpG, etc.

So, the question is what is the P‍18 search intended to do? Is it supposed to be a general search that can go poorly in unusual circumstances (unintended negation—which I wrote a blog post about a couple of years ago—or unexpectedly ambiguous searches) like the main search on the Special:Search page? Or is it supposed to be a file name/title–matching search like we have in the upper corner on Commons? Or is it trying to be both?

If it is a general search, then it is working more-or-less as intended, and these odd corner cases—particularly negation in the title/file name—are going to perform poorly.

If it is a file name/title–matching search, then it is using the wrong API, and should use the completion suggester API. @dcausse will be back next week (Feb 17) and he'd probably be the best one to ask about doing that the best way possible—maybe including prefixing searches with "File:" behind the scenes, though that may not be needed.

If it's supposed to be both, then the obvious options to me are to either live with the general search as is, or do something much more complicated like interleaving the general search and completion suggester results together. (My hypothesis is that something at least a little like that is already happening since the partly negated Barcelonnette - Villa du Parc du Mercantour -984.jpg search gets an exact match at the bottom of the list—maybe moving it to the top of the list of general search results would be sufficient.)

Brilliant answer, thanks for the insights and also for your lovely blog post.

So, the question is what is the P‍18 search intended to do? Is it supposed to be a general search that can go poorly in unusual circumstances (unintended negation—which I wrote a blog post about a couple of years ago—or unexpectedly ambiguous searches) like the main search on the Special:Search page? Or is it supposed to be a file name/title–matching search like we have in the upper corner on Commons? Or is it trying to be both?

I briefly discussed with @Lea_Lacroix_WMDE and indeed, ideally it should be both.

If it's supposed to be both, then the obvious options to me are to either live with the general search as is, or do something much more complicated like interleaving the general search and completion suggester results together. (My hypothesis is that something at least a little like that is already happening since the partly negated Barcelonnette - Villa du Parc du Mercantour -984.jpg search gets an exact match at the bottom of the list—maybe moving it to the top of the list of general search results would be sufficient.)

This! I agree that some kind of extra algorithm must be putting the exact file match to the bottom of the list - moving it to the top would be perfect. If search platform team can do that, please provide a quick feedback as to how and when it could happen. Alternatively, we would have to expand the P18 search box to do both, file name and general search, by sending out two different API requests, one to the "query" and one to the "opensearch" endpoint, then combine the results on our side. (Option 1 would be much preferred, though.)

Brilliant answer, thanks for the insights and also for your lovely blog post.

Glad to help. I'm always happy to try to figure out what's going on with unexpected search results.

I agree that some kind of extra algorithm must be putting the exact file match to the bottom of the list - moving it to the top would be perfect. If search platform team can do that, please provide a quick feedback as to how and when it could happen.

I don't think it's happening on the search side.

Looking at the Javascript behind the P‍18 search in my browser, I see calls to Special:ItemByTitle (search the code for ItemByTitle, the colon can be url-encoded). I'm not up to speed on modern Javascript and the code is very complex, but I think that might be the call that's fetching the exact title match and appending it to the list. Prepending it might do the trick, but someone much more proficient in Javascript and with better dev tools should take a look.

I don't think it's happening on the search side.

I am quite sure it is. Regardless of Wikidata's Javascript code behind the scenes, it is much more instructive to examine the outgoing requests to the commons API and their corresponding responses. The P‍18 search issues a single request to that API and you can easily reproduce the results in your browser, without any dev tools:

It can also be reproduced via the commons search page itself: Barcelonnette - Villa du Parc du Mercantour -984.jpg vs Barcelonnette - Villa du Parc du Mercantour -986.jpg

This has nothing to do with Wikidata's P‍18 search box or its Javascript code.

I think we both have incorrect models of what's happening! Neat!

I thought the apparent negation was removing the target file and something else on the Javascript side was putting it back in the wrong spot.

You seem to have thought the Search API was putting the exact match specifically in the last spot. (But maybe not.)

Turns out neither is correct. I'm still not 100% sure what's going on with the apparent negation. Without the .jpg at the end, the negation does in fact prevent the target from showing up. With it, it does show up, but not necessarily at the end. In fact, more often than not, the target is the first result!

I think what is happening is that the exact match is being injected into the list on the backend, but not at any particular spot on the list. It gets a score and the result is whatever it is. In the cases of the Barcelonnette - Villa du Parc du Mercantour... searches, they happen to end up 10th out of 10.

Others are first:

Others are neither first nor last:

So, the question now is whether we should try to modify search to push these results higher, overriding the current scoring method in some way, or whether the P‍18 search box should try to find any exact match, and either move it or insert it at the top of the list.

The easiest solution, from my point of view, would be to use the completion suggester API instead of the regular search API for search-as-you-type, since it was designed for that. It does the exact right thing for all of these cases—but it only works for title/name–matching.

Would it be possible to add a toggle to the P‍18 search box UI to allow the user to specify whether they are searching for a title match or doing a general search? (Just brainstorming.. but that might be the best of both worlds.)

Modifying the search results might take a fairly long time to get on our schedule (likely), might slow down search (not sure), and might be subject to failure at a later date or causing weird side effects for some other search (depending on implementation).

I'll put this on the agenda for our team meeting next Monday or Wednesday, and get an update back to you afterward.

We have in fact considered a toggle for the user to choose between title match and general search, but the result of such considerations was: let's have both at the same time, please.

I'll put this on the agenda for our team meeting next Monday or Wednesday, and get an update back to you afterward.

Ok, thanks. We will keep the possible approach of sending out two different requests and combining the results on our list of options. Or maybe we can simply re-order the result list ourselves, to put the exact match on top.

Change 572616 had a related patch set uploaded (by Silvan Heintze; owner: Silvan Heintze):
[data-values/value-view@master] Put exact matches on top of commons search results

https://gerrit.wikimedia.org/r/572616

Added a patch that puts an exact file name match on top of the P‍18 suggester, if there is one in the list of search results. This way, most of the cases described in the ticket should be solved without the need for an additional API call to a second "file name matcher" endpoint. Note that an exact file name match will still not show up on top, if it is not included in the list of top 10 search results, served by the commons 'query' API action.

Change 572616 merged by jenkins-bot:
[data-values/value-view@master] Put exact matches on top of commons search results

https://gerrit.wikimedia.org/r/572616

hoo subscribed.

Moving back to doing, because this needs a submodule update in Wikibase.

Change 572679 had a related patch set uploaded (by Silvan Heintze; owner: Silvan Heintze):
[mediawiki/extensions/Wikibase@master] Update data-values/value-view submodule

https://gerrit.wikimedia.org/r/572679

I believe that because the file name has many words the score on the tokenized text fields is very high (since we sum all token scores), the score on the exact match having only one word and despite having a high weight it's not enough to compete with the loss of its text matches discarded because of the negation.

In general I suggest using autocomplete APIs (opensearch/prefixsearch) for type-a-head searches, this is faster and the list of results does no change unexpectedly as you type. What's done in the mobile app is a two steps search: first send a prefixsearch then a fulltext search if not results are found.

When using the fulltext search (list=search) if the user is not aware that it's using the fulltext engine the UI should escape the search syntax otherwise some chars may trigger a special syntax (negation in this case).

The proper way to fix this issue is imo to:

  • use a completion API + fulltext search fallback
  • escape the fulltext search syntax from the UI: AND, OR, NOT, ||, &&, -, !, ", :, \?, *

Change 572679 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Update data-values/value-view submodule

https://gerrit.wikimedia.org/r/572679

So this is currently using action=query and list=search.

For example:

https://commons.wikimedia.org/w/api.php?callback=jQuery34107062500487253558_1582096853620&action=query&list=search&srsearch=Barcelonnette%20-%20Villa%20du%20Parc%20du%20Mercantour%20-985.jpg&srnamespace=6&srlimit=10&format=json&_=1582096853622

In general I suggest using autocomplete APIs (opensearch/prefixsearch) for type-a-head searches, this is faster and the list of results does no change unexpectedly as you type. What's done in the mobile app is a two steps search: first send a prefixsearch then a fulltext search if not results are found.

The mobile app seems to make queries such as:

?format=json&formatversion=2&errorformat=plaintext&action=query&redirects=&converttitles=&prop=description|pageimages|info&piprop=thumbnail&pilicense=any&generator=prefixsearch&gpsnamespace=0&list=search&srnamespace=0&inprop=varianttitles&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&pithumbsize=320&gpssearch=Barcelonnette&gpslimit=20&srsearch=Barcelonnette
?format=json&formatversion=2&errorformat=plaintext&action=query&converttitles=&prop=description|pageimages|pageprops|info&ppprop=mainpage|disambiguation&generator=search&gsrnamespace=0&gsrwhat=text&inprop=varianttitles&gsrinfo=&gsrprop=redirecttitle&piprop=thumbnail&pilicense=any&pithumbsize=320&gsrsearch=Barcelonnette&gsrlimit=20

For our search the examples would be:

https://commons.wikimedia.org/w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&redirects=&converttitles=&prop=description|pageimages|info&piprop=thumbnail&pilicense=any&generator=prefixsearch&gpsnamespace=6&list=search&srnamespace=6&inprop=varianttitles&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&gpssearch=Barcelonnette%20-%20Villa%20du%20Parc%20du%20Mercantour%20-985.jpg&gpslimit=20&srsearch=Barcelonnette%20-%20Villa%20du%20Parc%20du%20Mercantour%20-985.jpg

https://commons.wikimedia.org/w/api.php?format=json&formatversion=2&errorformat=plaintext&action=query&converttitles=&prop=description|pageimages|pageprops|info&ppprop=mainpage|disambiguation&generator=search&gsrnamespace=6&gsrwhat=text&inprop=varianttitles&gsrinfo=&gsrprop=redirecttitle&piprop=thumbnail&pilicense=any&pithumbsize=320&gsrsearch=Barcelonnette%20-%20Villa%20du%20Parc%20du%20Mercantour%20-985.jpg&gsrlimit=20

Might lead to a slight change in behaviour so we should probably bounce any change like this back to product (I'll write a ticket), for now I figure the current approach will do :)

We talked about this in our team meeting today, and @dcausse opened T245642 after our discussion. We also uncovered the difference between these two queries:

  • Barcelonnette - Villa du Parc du Mercantour -984 (12 results)
  • Barcelonnette - Villa du Parc du Mercantour -984.jpg (13 results, including the desired one)

We have another index of titles and the second query above (with .jpg) is an exact match (after parsing) of the title of the file, so it gets added into the mix. It doesn't score well enough to make it to the top of the list, but at least it is there.

The current boosting of the near_match is 2, which was set a long time ago. Many changes have been made since then. Increasing the boost to 10 should improve the ranking of most title matches. It may not cover every possible case, but it should raise many of these to the #1 spot, and it should raise many others into the top 10 so that the P18 patch above can find them and elevate them the rest of the way. (That patch will still be very helpful in cases like the Eglise Notre Dame de l'Assomption searches above, since all of those files will appear identical to the near_match scoring.)

I'll also copy over some examples from here as test cases for T245642.

Thanks for following up on this. T245589#5896906 also describes how to combine file name prefix search and general search in a single request.

@Ayack The issue should now be fixed: can you try it from your side and let us know if you encounter further issues?