Page MenuHomePhabricator

Expose cirrusIncLinkssW for wighting the sort of results returned by the nearcoord: search
Closed, ResolvedPublic

Description

The iOS app has been using the cirrusIncLinkssW profile for weighting the search results of the nearcoord: Cirrus search in beta testing.

What are the next steps in making this public so it can be used in the production version of the app?

This is an example of how its being used:

api.php?action=query&cirrusIncLinkssW =1000&colimit=50&format=json&generator=search&gsrlimit=50&gsrsearch=nearcoord%3A5421km%2C37.133%2C-95.786&pilimit=50&piprop=thumbnail&pithumbsize=240&prop=coordinates%7Cpageimages%7Cpageterms

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2017, 10:41 PM

Are you using varied values of weights, or is there a single weight (1000) that seems to work best in general? If a single value works well we can create a new named profile (easy), perhaps named 'popular', and your query would replace cirrusincLinksW=1000 with gsrqiprofile=popular.

Change 337191 had a related patch set uploaded (by EBernhardson):
[WIP] Expose a search profile for popular pages

https://gerrit.wikimedia.org/r/337191

@EBernhardson just a single weight of cirrusincLinksW=1000 was what worked best.

One thing that wasn't clear was what the default "sorting" was AND how that interplayed with the weighting.

@EBernhardson and popular is ok with us!

And thanks for putting the patch up so quickly

The defaulting sorting isn't super obvious, although because your query doesn't query any text features its lots simpler than default search. you can get some insight into it with:

https://en.wikipedia.org/w/api.php?action=query&cirrusIncLinkssW%20=1000&colimit=50&format=json&generator=search&gsrlimit=50&gsrsearch=nearcoord%3A5421km%2C37.133%2C-95.786&pilimit=50&piprop=thumbnail&pithumbsize=240&prop=coordinates%7Cpageimages%7Cpageterms&cirrusDumpResult&cirrusExplain=pretty

For your query, because it only uses a filter (nearcoord) and no text the score was:

(3 * (  popularity^0.8 / ( popularity^0.8 + 8e-6^0.8) ) ) + ( 10 * ( incoming_links ^ 0.7 / ( incoming_links ^ 0.7 + 30^0.7 ) ) )

When providing 1000 as the popularity weight that equation becomes:

(1000 * (  popularity^0.8 / ( popularity^0.8 + 8e-6^0.8) ) ) + ( 10 * ( incoming_links ^ 0.7 / ( incoming_links ^ 0.7 + 30^0.7 ) ) )
EBernhardson added a comment.EditedFeb 13 2017, 7:00 PM

I should also note that the sorting will only be completely deterministic if there are less than 8k results per shard. A query over a large area, like the example of 5421km against enwiki has ~300k results split across 7 shards. Because there is only a filter run on the first phase of the query it will take the first 8192 documents it sees per-shard and run the popularity rescore against only those. I suspect though that as this is a nearby feature it would be releatively rare to bump up against this problem in actual usage.

@EBernhardson very useful explanation, thank you!

@JoeWalsh ^

Change 337191 merged by jenkins-bot:
Expose a search profile for popular pages

https://gerrit.wikimedia.org/r/337191

The profile should ship with the wmf.12 branch, hitting full-production this Thursday. The profile ended up being named popular_inclinks_pv. It still includes the incoming links (although at a very small weight compared to page views) to offer some level of discrimination between pages with very similar popularities.

@EBernhardson sounds great… thanks again!

EBernhardson triaged this task as Normal priority.
EBernhardson moved this task from in progress to Done on the Discovery-Search (Current work) board.
debt closed this task as Resolved.May 30 2017, 5:34 PM