Page MenuHomePhabricator

pasting full URL into entity selector no longer works
Closed, ResolvedPublic

Description

It used to be possible to past the full URL of an item or property into the entity selector. It no longer gives a result.

Event Timeline

@Smalyshev can you have a look? @thiemowmde had some special handling for this it seems.

thiemowmde moved this task from incoming to ready to go on the Wikidata board.
thiemowmde added a subscriber: hoo.

The code that made this possible was in EntitySearchTermIndex::getEntityIdsMatchingSearchTerm, introduced via T117763. The idea was intentionally trivial:

  • The first step always tried to parse the users input as an entity ID.
  • If this fails, the regex /.*(\b\w{2,})/s tried to grep the last ASCII sequence from the users input, and parse that as an entity ID. This covers full URLs as well as copy-pastes like "(Q42)".
  • All this is cheap.
  • In addition the term table was queried in two steps: exact matches first, then prefix matches.

This code path is not executed any more with the switch to the new EntitySearchElastic implementation, introduced via T125500.

To bad we had no integration test for this feature.

Change 386602 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Add integration tests for pasting full URLs into entity selectors

https://gerrit.wikimedia.org/r/386602

We could probably pre-process the input, yes. Though I am not sure we should encourage these things... while something like case-insensitive match is common search functionality, pasting URLs etc. with magic rules seems to be going a bit too far. But if old one supported it, fine. I wish there were some docs though or tests which extra syntaxes we process. So far I've got:

  • URL - should it match only wikidata URL? Would http://google.com/Q42 also look for Q42 (which is kinda weird)?
  • Parens removal - (Q42) is the same as Q42. Should it work for others too, e.g. (P42)? (L42)? (Douglas Adams)? Should (Q42 and Q42)))) and ())Q42() also work?

If this fails, the regex /.*(\b\w{2,})/s tried to grep the last ASCII sequence from the users input, and parse that as an entity ID

That's not how Elastic search currently works, in general, we don't have ID parsing etc. In fact, Elastic index has no idea what "ID" is. We match the title against Elastic index. We could add an extra step of ID-parsing, but this complicates things quite a bunch, since if we don't run the search we don't have the necessary data. We could parse ID and then extract it and run the search on it, but that looks like duplicating the work. Anyway, if we have pre-processing rules, we could probably do it.

I think we can keep it to Wikidata URLs.
Parenthesis removal should also work for the other entity types. I don't think we need it for labels etc.

The url can be in different formats:

Not sure how many variants we have. Not to hard to catch in a regex, might be a bit more clutter when you add these urls to preprocessing.

Yes this is a problem... I wonder though whether we should cover all of these. I can see why one would paste the first one into the selector, but the other ones one would have to take from RDF or specially construct... Is it a reasonable expectation that search would cover all of them?

I believe that even pasting a partial URL like "ata.org/wiki/Q42" should work. And it easily can, as I already tried to show above. Just try to apply two steps:

  1. Try to parse the users input as an entity ID. We do have WikibaseRepo::getEntityIdParser() which should be used for this.
  2. If this fails, fetch the last alphanumeric character sequence from the users input and try again. The regex /.*(\b\w{2,})/s I used for this can even be simplified if you like: /.*(\b\w+)/s.

EntityIdParser throws exception on parse error. Is there any API that allows to check whether something is a valid ID without throwing?

What about utilizing a try-catch? You can wrap it in a private passesEntityIdParsing function if you like.

Change 387025 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add special case handling for some forms of IDs

https://gerrit.wikimedia.org/r/387025

Change 389698 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig):
[mediawiki/extensions/Wikibase@master] Restore wgCirrusSearchRescoreFunctionScoreChains after test

https://gerrit.wikimedia.org/r/389698

thiemowmde moved this task from Review to Done on the Wikidata-Sprint-2017-11-07 board.
thiemowmde removed a project: Patch-For-Review.
thiemowmde moved this task from Review to Done on the Wikidata-Former-Sprint-Board board.

Change 387025 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add special case handling for some forms of IDs

https://gerrit.wikimedia.org/r/387025

Change 386602 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add integration tests for pasting full URLs into entity selectors

https://gerrit.wikimedia.org/r/386602

Change 389698 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Restore wgCirrusSearchRescoreFunctionScoreChains after test

https://gerrit.wikimedia.org/r/389698