Page MenuHomePhabricator

Evaluate scoring of query independent factors: page views, incoming links and size
Closed, DuplicatePublic

Description

As part of our Q3 goal to get page views to affect our search result rankings, we need to evaluate what weight the page views should have and how they should affect scoring.

Event Timeline

Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.
Deskana added a subscriber: dcausse.

@dcausse: Erik recommended that you take a look at this one. Let me know if this is enough to go on, or if you need more information.

@Deskana sorry, I completely forgot to follow up on this task, it's already possible to play with pageviews on en-suggesty by appending &cirrusRescoreProfile=pageviews e.g. :

http://en-suggesty.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=kennedy&fulltext=Search&cirrusRescoreProfile=pageviews

It's just a config change...

@TJones: let's add this task to the list, I'll make a patch to make profile selection easier in runSearch

Also I realize that we did all the work to include pageviews in the completion suggester but I think I forgot to enable it in production :(

Change 273877 had a related patch set uploaded (by DCausse):
[WIP] Allow to override factor in field_value_factor

https://gerrit.wikimedia.org/r/273877

I'm currently facing an issue where I can't really integrate pageviews in the current rescore window.
The fact that we use product to rescore documents is problematic and I failed to find a proper solution to weight the impact of boostlinks and/or pageviews.
The scoring function is simScore * log(incLinks+2) the problem is that I can't really affect the impact of incLinks in the ranking. Because of the product what will affect the rank is the factor between log(page1.incLinks) / log(page2.incLinks) and not really log(page1.incLinks) - log(page2.incLinks)...

After reading some papers it appears that we should use a weighted sum instead of a multiplication :
similarity + weight1*component1 + weight2*component2 like we do for the phrase rescore.
This makes the tuning extremely hard since we have to know the range of the similarity score. And... the current default similarity function we use is too bad to allow proper weight control in the ranking.
To experiment I've enabled bm25 on suggesty and added a rescore function with incLinks and pageviews (I dropped boosttemplates for now)
I'm not saying that results are better but it allows a fine-grained control on the ranking somting that was not possible with product:
By increasing the pop score weight from 0.126 to 0.127 you can make JFK to win one position :
Ranked #5 with 0.126 and #4 with 0.127 or 0.150 to be ranked #2.
This was not really possible with a product...
The drawback is that we need a really stable similarity score, the default from lucene was nearly unusable e.g. the word 'kennedy' was 0.3 on the JFK page and 0.9 on some others... This is why I was curious about BM25 so I've enabled it on suggesty... and it actually worked, weights adjustment seems to be more reasonable since you don't have to put insane weights to actually see an impact.

I've asked to Trey to review my analysis, if we find that we cannot work with a product then I'm afraid we can't really integrate pageviews and allow fine-grained control of the impact (which I think it's not possible today with incLinks: How do I influence the impact of incomingLinks in the function today?).

dcausse renamed this task from Evaluate scoring of page view weights in fulltext search, so that it can be pushed to production to Evaluate scoring of query independent factors: page views, incoming links and size.Mar 25 2016, 10:43 AM
dcausse updated the task description. (Show Details)

Merged all tasks related to query independent factors into this one because the evaluation of all these factors is very similar.

Change 273877 merged by jenkins-bot:
Added various rescore functions

https://gerrit.wikimedia.org/r/273877

Change 280245 had a related patch set uploaded (by DCausse):
CirrusSearch: Add new rescore profiles

https://gerrit.wikimedia.org/r/280245

Change 280245 merged by jenkins-bot:
CirrusSearch: Add new rescore profiles

https://gerrit.wikimedia.org/r/280245

Change 281209 had a related patch set uploaded (by EBernhardson):
CirrusSearch: Add new rescore profiles

https://gerrit.wikimedia.org/r/281209

dcausse changed the task status from Open to Stalled.Apr 25 2016, 8:35 AM
dcausse moved this task from needs triage to search-icebox on the Discovery-Search board.

Change 281209 abandoned by EBernhardson:
CirrusSearch: Add new rescore profiles

Reason:
not necessary anymore

https://gerrit.wikimedia.org/r/281209