Page MenuHomePhabricator

Evaluate using SERP click throughs to build a search feedback loop
Closed, ResolvedPublic

Description

It might be interesting to evaluate what effect there would be for creating a field in the elasticsearch documents to contain queries that have previously resulted in clicks to the page, and then boosting queries that contain those words.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Built a script to aggregate together a list of page_id's and query strings from a week (10/10 - 10/16) of data in hive. This was limited to queries that were issued at least 10 times during that week to keep the data size down, but it might be worthwhile to re-evaluate that limit.

Using a process roughly equal to the current transferToES.py script this data was then transformed into a set of updates for elasticsearch, adding a queries field to each page that contains queries that have previously led to clicks on that page. No additional weighting was done (although perhaps should be considered) for something that was clicked many times vs something that was clicked only once.

In all, this worked out to ~167.6k document updates to the enwikibm25perfield_content_first index. A few updates errored out (probably page id's that are redirects or something? not sure. didn't look too closely).

A mapping for the field was created, roughly copying the existing title field, as:

{
  "properties": {
    "queries": {
      "position_increment_gap": 10,
      "search_analyzer": "text_search",
      "analyzer": "text",
      "fields": {
        "plain": {
          "position_increment_gap": 10,
          "search_analyzer": "plain_search",
          "analyzer": "plain",
          "similarity": "title_plain",
          "type": "string"
        }
      },
      "similarity": "title",
      "type": "string"
    }
  }
}

I then ran paul score, drawing queries from the same week the data was collected. This is rather unlikely, but for test purposes seems the best chance of it working. Will run against with data from days following at a later point.

The short of this is that there is a small positive difference, but it's not that large. It might be worthwhile though to try with a more carefully thought out plan for how to analyze the query strings (the mapping), weighting for pages that were clicked more times, and consider using a lower threshold (currently the query must be issued by 10 distinct ip addresses) for when a query is included in the set.

I could also write up a quick script to see how many queries being used for the popularity score roughly match a query in the data export that was loaded into elasticsearch.

scorebm25_inclinksbm25_inclinks_w_queries^0.5bm25_inclinks_w_queries^0.8bm25_inclinks_w_queries^2
PaulScore@0.90.640.640.640.64
PaulScore@0.70.560.570.570.57
PaulScore@0.50.520.520.530.53
PaulScore@0.10.470.480.480.47

bm25_inclinks histogram:

 0 (766): ****************************************
 1 (147): *******
 2 ( 68): ***
 3 ( 50): **
 4 ( 26): *
 5 ( 22): *
 6 ( 19): *
 7 ( 15): 
 8 ( 15): 
 9 (  9): 
10 ( 14): 
11 (  9): 
12 (  8): 
13 (  8): 
14 (  9): 
15 (  7): 
16 (  6): 
17 (  3): 
18 (  9):

bm25_inclinks_w_queries^0.5

 0 (769): ****************************************
 1 (145): *******
 2 ( 70): ***
 3 ( 44): **
 4 ( 30): *
 5 ( 21): *
 6 ( 21): *
 7 ( 16): 
 8 ( 15): 
 9 (  7): 
10 ( 12): 
11 ( 10): 
12 (  8): 
13 ( 12): 
14 (  8): 
15 (  9): 
16 (  5): 
17 (  4): 
18 (  6):

bm25_inclinks_w_queries^0.8

 0 (769): ****************************************
 1 (148): *******
 2 ( 67): ***
 3 ( 44): **
 4 ( 31): *
 5 ( 23): *
 6 ( 20): *
 7 ( 17): 
 8 ( 14): 
 9 (  6): 
10 ( 12): 
11 ( 10): 
12 ( 12): 
13 ( 11): 
14 (  4): 
15 (  9): 
16 (  5): 
17 (  4): 
18 (  7):
 0 (767): ****************************************
 1 (151): *******
 2 ( 70): ***
 3 ( 42): **
 4 ( 31): *
 5 ( 23): *
 6 ( 20): *
 7 ( 17): 
 8 ( 10): 
 9 ( 10): 
10 ( 14): 
11 ( 14): 
12 (  8): 
13 (  9): 
14 (  4): 
15 (  8): 
16 (  6): 
17 (  5): 
18 (  5):

Based on Erik's notes in T147501#2738599, >75% of queries occur only once. I'm not sure what to do with that.

It would be nice to lower the thresh below 10 to get more data, though. Given that most people click on the top 3 (or maybe top 5), having at least that many searches gives people a chance to click on a few different results—but some people don't click, so double that seems reasonable, too... and we're back to 6-10.

For smaller wikis, though, requiring 10 IPs per query is going to make it hard to gather enough data.

And I guess it would be possible to test the effect of lowering the threshold in the lab, but it's kind of expensive, right?

I'm also wondering about how things get indexed. If the queries are randomly adjacent to each other in the queries field, it's possible to get phrase matches on strings that weren't in any original queries, right? That would've been a problem with the allfield, too. Any way to prevent that? Maybe the effect is small and it doesn't matter.

And I'm not sure how to accomplish weighting of queries or terms, other than repeating them, which is something we dislike about the allfield.

Overall, an interesting idea! It does seem that it could work, and I very much like the idea of incorporating user behavior into the scoring—though I worry a bit about Googlebombing-like behavior from users.

debt triaged this task as Low priority.Oct 25 2016, 5:49 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
Deskana raised the priority of this task from Low to Medium.Dec 6 2016, 6:44 PM
Deskana subscribed.

Bumping up priority; this is a relatively experimental thing which is worth exploring more since we're making excellent progress on our goals.

Change 326168 had a related patch set uploaded (by EBernhardson):
Lucene Stemmer UDF

https://gerrit.wikimedia.org/r/326168

debt subscribed.

Moving to later for now...

EBernhardson claimed this task.

in general this has been complete with the implementation of mjolnir.