Page MenuHomePhabricator

Rest Search API is not wikidata aware (only accepts queries beginning with Q)
Open, Needs TriagePublic

Description

The existing search API only works with queries containing "Q" and returns results without the correct display title
https://wikidata.org/w/rest.php/v1/search/title?q=Q3&limit=10

This means in future Wikidata will become useless with the roll out of the latest version of Vector and will stall further adoption efforts of the wikimedia wvui library.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Addshore added subscribers: Lydia_Pintscher, Addshore.

This means in future Wikidata will become useless with the roll out of the latest version of Vector and will stall further adoption efforts of the wikimedia wvui library.

We override the default search box in Wikibase for an entity selector.
It sounds like we will need to make that change in a slightly different way for the new Vector skin before rollout?
Though the behaviour of the search API is probably correct? / same as it was in the action api?

Yeh, we are moving away from the jquery autocomplete implementation and will no longer be supporting it inside Vector (and at some point Minerva).

The new version of the search is built via Vue.js and uses new API recently created by the platform team. While the old jquery autocomplete does still work it's not being actively tested and won't benefit from thumbnails and wikidata descriptions like other wikis. I think Minerva is the more pressing challenge here as we'd like to eventually port the code in mobile to vue.js and this would potentially block it (unless we're okay with not having JavaScript powered search results on Wikidata.org)

While the new search UI works, as mentioned its only searching across titles. Perhaps all that's needed is a hook for Wikidata to change where the search looks.

Tagging platform team who might be able to help provide some advice.

You should be able to try this out on Vector locally with just the following 3 lines of config (no need to install wikidata extensions):

$wgVectorSearchHost = 'www.wikidata.org';
$wgVectorDefaultSkinVersion = '2';
$wgVectorUseWvuiSearch = true;

Code:
https://github.com/wikimedia/mediawiki/blob/master/includes/Rest/Handler/SearchHandler.php

Looks like there are hooks for thumbnail and description.
A hook would likely be needed in https://github.com/wikimedia/mediawiki/blob/master/includes/Rest/Handler/SearchHandler.php#L102

While the new search UI works, as mentioned its only searching across titles.

We will need to do more that title search for Wikidata / Wikibases.

Perhaps all that's needed is a hook for Wikidata to change where the search looks.

It sounds like that could work.
If we can hook in and then output something like this

Current behaviour is that we only override the search box in the top right, and other search (special:search and apis etc) remain the same.
@Lydia_Pintscher would need to say if it is okay for a larger behaviour change or if we want to continue only changing the search box.

Another problem here is going to be that for instrumentation reasons, the url of suggestions is https://www.wikidata.org/w/index.php?search=Universe+%28Q1%29

On wikipedia such a query would redirect directly to the title e.g. https://en.wikipedia.org/w/index.php?search=Universe

Code:

https://github.com/wikimedia/wvui/blob/master/src/components/typeahead-suggestion/UrlGenerator.ts#L45

Current behaviour is that we only override the search box in the top right, and other search (special:search and apis etc) remain the same.
@Lydia_Pintscher would need to say if it is okay for a larger behaviour change or if we want to continue only changing the search box.

Could we clarify what would change how? Currently it's not clear to me so I can't assses if we can/should do this and with how much announcement or consultation.

From my 1:1 with @Lydia_Pintscher today

So right now the fact that the searchbox is only searching items is not good.
This makes it hard to find lexeme,s project pages etc.
So the current state is not ideal for us.

We don't have any great ideas about ranking these different entity types together etc, and making sense of what users would end up seeing. (even along side regular pages).
How do you find out if Tree the item should be ranked higher or lower than Tree the lexeme, or some other project page called Tree?
If we figure out the answers to these questions then we could probably move toward making use of the new search UI for Wikidata / Wikibase and ditching the old thing..
The search team might have some thoughts on this? CC: @MPhamWMF

There is quite a big difference in behaviour currently between what is served up through Special:Search and the entity suggester box for Wikidata right now
This can be seen in the screenshot below:

Related would also be:

A quote from that last ticket

Mixing together mainspace items and pages from other namespaces could result in some confusion. For example, if a user is searching for "Template:Support", there would be two items with that label, along with the local template itself titled "Template:Support". If these are to be mixed, there needs to be some way to clearly indicate whether a page is a Wikidata item or not.

@Jdlrobson could you provide a screenshot of the look and feel of the new results view (or somewhere to see it)

TJones renamed this task from Rest Search API is not wikidata aware (only accepts queries beginning with Q to Rest Search API is not wikidata aware (only accepts queries beginning with Q).Mar 17 2021, 3:56 PM

@Jdlrobson could you provide a screenshot of the look and feel of the new results view (or somewhere to see it)

@Addshore you can test this out on https://wikidata.beta.wmflabs.org/?useskinversion=2

LGoto added a subscriber: LGoto.

Web team would like to discuss the status of this with the associated teams.

@Addshore, I talked with the Search Platform team about ranking results, i.e. the Tree example, and I think the consensus is unfortunately that there's not a single correct answer to this in the general case. Realistically we'd have to dig more into what Wikidata users find valuable and relevant when searching and adapt our relevance ranking from there.

My own hot take on Tree the Item vs Tree the Lexeme (besides sounding like an epic rap battle I want to listen to) is that if we think that one can be very relevant to one search at the exclusion of the other, and vice versa, then this could be motivation to split out the Lexeme graph.

It sounds like the "right" thing to do for now would be to return everything here, starting with items, then properties, then lexemes (as a starting point).
We could probably use our search code that powered wbsearchentities, and just loop through entity types in some order so that we prioritize items etc.

It looks like we would also need a hook to override the title itself?
This would allow us to add both the label and description of items for example to the result.
Once we figure this bit out I think this would be ready for us to work on as long as @Lydia_Pintscher agrees with just trying this as a first attempt.

I like the look of this bottom option and I wonder if there might also be a way to hook in and add more options / a different option here?
This would allow us for example to more easily search for other types of entities for other usecases?

I'm unconvinced that hooking into SearchHandler is the Right Thing. The endpoint is /v1/search/title, making that do anything but title auto-completion would be confusing, it would break the contract. I'd also argue that we will still want title auto-completion for some namespaces.

The desired behavior for the search box in Wikidata differs significantly from the expected behavior for vanilla MediaWiki. The behavior could even depend on the namespace the user is currently in, or offer results from different namespaces in sections or side by side. In my mind, it should be a different UI component, backed by a dedicated API. The way we hacked this into the skin in the past was rather nasty, perhaps a better mechanism can be found.

Alternatively, Wikibase can hook into search index generation to change what the "page title" is for items. But the multi-lingual nature of Wikibase labels makes that hard.

In my mind, it should be a different UI component, backed by a dedicated API. The way we hacked this into the skin in the past was rather nasty, perhaps a better mechanism can be found.

Please no more UI components.. that would be a maintenance disaster as Wikidata would need to do this for every skin (we plan to use this same Vue component inside the Minerva skin).

The API used by the existing UI component is configurable so theoretically, Wikidata could have its own API which returns data using the same spec with the right level of abstraction. I think this might be a better approach then rebuilding the UI and all the complexity that would go with it.

Please no more UI components.. that would be a maintenance disaster as Wikidata would need to do this for every skin (we plan to use this same Vue component inside the Minerva skin).

I don't have super strong feelings about that, I just remember that Wikibase search shows a lot more info than what the "normal" search popup shows. The data fields are not the same (URL, label, matched alias, description, matches in different languages potentially using different directionality), and it seems like additional structuring will be needed to accommodate Lexemes etc.

If I recall correctly, the main problem was that the custom search box was hacked into the skin in a horrible way. Perhaps that could be improved.

The API used by the existing UI component is configurable so theoretically, Wikidata could have its own API which returns data using the same spec with the right level of abstraction. I think this might be a better approach then rebuilding the UI and all the complexity that would go with it.

The problem is that Wikibase can't really use the same spec. It needs quite a bit of extra info from the search backend in order to do what it does. At least, that's what I recall from shoehorning this in many years ago.

Including extra info in the result isn't such a big issue. The bigger issue is that it's matching by different criteria, and the thing that is matched is not always the primary label.

The Wikibase folks will know the details better than I do. My concern is from the perspective of the core API: it has a specific contract, and it should not be used for things that do not match that contract. The contract is: "Searches wiki page titles, and returns pages with titles that begin with the provided search terms."

My concern is from the perspective of the core API: it has a specific contract, and it should not be used for things that do not match that contract. The contract is: "Searches wiki page titles, and returns pages with titles that begin with the provided search terms."

I understand and I'm saying that this could be implemented using an abstracted PHP interface which provides a contract for the format in the response, without having any knowledge of the implementation.

The problem is that Wikibase can't really use the same spec

When I mean spec, I'm referring to the output API.

For example, https://en.wikipedia.org/w/rest.php/v1/search/title?q=Spongebob%20Squarepants%20&limit=10 responds with :

{
  "pages": [
    {
      "id": 2655089,
      "key": "SpongeBob_SquarePants",
      "title": "SpongeBob SquarePants",
      "excerpt": "SpongeBob SquarePants",
      "description": "American animated television series",
      "thumbnail": {
        "mimetype": "image/svg+xml",
        "size": 25839,
        "width": 200,
        "height": 107,
        "duration": null,
        "url": "//upload.wikimedia.org/wikipedia/en/thumb/2/22/SpongeBob_SquarePants_logo_by_Nickelodeon.svg/200px-SpongeBob_SquarePants_logo_by_Nickelodeon.svg.png"
      }
    }
..
  ]
}

If an API was created that returned data in the same format, the search UI would mostly function.

{
  "pages": [
    {
      "key": "Q935079",
      "title": "SpongeBob SquarePants (Q935079)",
      "excerpt": "main character of the animated television show SpongeBob SquarePants",
      "description": null,
      "thumbnail": {
        "mimetype": "image/svg+xml",
        "size": 25839,
        "width": 200,
        "height": 107,
        "duration": null,
        "url": "//upload.wikimedia.org/wikipedia/en/thumb/2/22/SpongeBob_SquarePants_logo_by_Nickelodeon.svg/200px-SpongeBob_SquarePants_logo_by_Nickelodeon.svg.png"
      }
  ]
}

The implementation can be completely different, living in Wikidata if necessary. Right now, we allow configuration on the host level, but if this is the direction we want to take, we can make the path configurable our side to to support this.

I honestly don't see any other way to get this to work, without disabling the JavaScript search experience altogether and relying on a gadget.

I understand and I'm saying that this could be implemented using an abstracted PHP interface which provides a contract for the format in the response, without having any knowledge of the implementation.

That is possible, but I don't see the point. Why add another layer of indirection in order to make the same endpoint to two different things?

When I mean spec, I'm referring to the output API.

The format of the output is one part of the contract. The other part is the relationship between the input and the output, which is defined as "title prefix". To accommodate the Wikibase use case, it would have to be softened to "some kind of match to an identifier of the page" (doesn't have to be the title, but it's not full text either).

I'd rather not weaken the contract of the existing endpoint. I'd prefer a separate endpoint, that has a compatible output format.

If an API was created that returned data in the same format, the search UI would mostly function.

Yes, mostly. The question is whether that's good enough. I recall that we invested quite a bit of work into getting additional information into the search popup.

For example, if I type "تهران" into the search box on wikidata.org, the API responds with entries like this:

{
   "id":"Q643031",
   "title":"Q643031",
   "pageid":605069,
   "repository":"wikidata",
   "url":"//www.wikidata.org/wiki/Q643031",
   "concepturi":"http://www.wikidata.org/entity/Q643031",
   "label":"Tehran County",
   "description":"county in Tehran, Iran",
   "match":{
      "type":"label",
      "language":"ps",
      "text":"\u062a\u0647\u0631\u0627\u0646 \u0648\u0644\u0633\u0648\u0627\u0644\u06cd"
   },
   "aliases":[
      "\u062a\u0647\u0631\u0627\u0646 \u0648\u0644\u0633\u0648\u0627\u0644\u06cd"
   ]
},

Note the "match" and "aliases" keys, and note the rendering of the matched alias in the popup, separate from the disambiguating description, with correct LTR orientation:

Extra info like this can be added to the search/title endpoint, the output is extensible. It could also be returned from a separate endpoint. But the UI would also need to use it, that's why I was suggesting a separate UI component. Anyway, assessing the importance of this is up to the Wikidata folks. I'm more concerned with the contract of the search/title endpoint.

The implementation can be completely different, living in Wikidata if necessary. Right now, we allow configuration on the host level, but if this is the direction we want to take, we can make the path configurable our side to to support this.

For the "same UI, different backend" solution, that would work. The big question is whether Wikidata is OK with "same UI", loosing the extra fatures.

Another fun wrinkle to all this:

One long standing issue with the search box on commons is that namespace prefixes do not work. You can't type in "User:..." to search user pages. Since the search box always hits entitysearch, it won't find anything. To fix this, there has to be code somewhere that recognizes namespace prefixes, and based on that decides whether to do a title search or an entity search. Doing this on the client side would be more flexible (e.g. could show both results in separate sections).

One long standing issue with the search box on commons is that namespace prefixes do not work. You can't type in "User:..." to search user pages. Since the search box always hits entitysearch, it won't find anything. To fix this, there has to be code somewhere that recognizes namespace prefixes, and based on that decides whether to do a title search or an entity search. Doing this on the client side would be more flexible (e.g. could show both results in separate sections).

That's tracked in T277363.

One long standing issue with the search box on commons is that namespace prefixes do not work. You can't type in "User:..." to search user pages. Since the search box always hits entitysearch, it won't find anything. To fix this, there has to be code somewhere that recognizes namespace prefixes, and based on that decides whether to do a title search or an entity search. Doing this on the client side would be more flexible (e.g. could show both results in separate sections).

That's tracked in T277363.

Not exactly the same issue... or rather, another instance of the same issue. Wikidata has had this problem forever, since searchentities doesn't know about namespaces at all. For core, it's a bug. For wikibase, it's a conceptual issue, since "User:Foo" can be a user page and also the label of an item, and the search should find both.

I'm bringing it up here because the solution to this ticket should somehow address the question of how title-based search in some namespaces might be combined with label based search in other namespaces. Both in the UI and in the API.