Page MenuHomePhabricator

Run Paulscore with BM25 on zh, ja, th
Closed, ResolvedPublic

Description

Based on the findings in T147008, let's run Paulscore with BM25 on the Chinese, Japanese and Thai wiki's to see what we get from it.

Note: a second BM25 A/B test will be tracked here: T147495

Event Timeline

debt triaged this task as Medium priority.Oct 5 2016, 7:25 PM
debt created this task.
debt updated the task description. (Show Details)

prod indices imported to relforge, currently importing bm25 copies of those.

tl/dr: Both values for jawiki are surprisingly really bad. I wonder if there is something going wrong in relforge that mishandles
the non-latin based queries. Will need to investigate.

Using:

query = ./sql/extract_query_and_click_logs.all_clicks.yaml⏎
wiki = jawiki⏎
date_start = 20161001000000⏎
date_end = 20161018000000⏎

Also using the default paulscore factor of 0.7

jawiki-bm25:

Loaded 2009 sessions with 2932 clicks and 4311 unique queries
Engine Score: 0.08
Histogram:
 0 (205): *****************************************
 1 ( 50): **********
 2 ( 29): *****
 3 ( 17): ***
 4 ( 10): **
 5 ( 10): **
 6 (  6): *
 7 (  9): *
 8 (  1): 
 9 (  0): 
10 (  5): *
11 (  5): *
12 (  7): *
13 (  2): 
14 (  3): 
15 (  2): 
16 (  1): 
17 (  2): 
18 (  2):

jawiki-tfidf:

Loaded 2009 sessions with 2932 clicks and 4311 unique queries
Engine Score: 0.09
Histogram:
 0 (229): *********************************************
 1 ( 60): ************
 2 ( 23): ****
 3 ( 19): ***
 4 (  6): *
 5 (  8): *
 6 (  4): 
 7 (  6): *
 8 (  2): 
 9 (  3): 
10 (  2): 
11 (  4): 
12 (  4): 
13 (  1): 
14 (  4): 
15 (  0): 
16 (  1): 
17 (  2): 
18 (  1):

tl/dr: Both values for jawiki are surprisingly really bad. I wonder if there is something going wrong in relforge that mishandles the non-latin based queries. Will need to investigate.

Two things come to mind.

First, there are lots of layers in RelForge, so something could get lost in translation. Could this be the residue of non-Japanese queries if Japanese characters are getting mangled?

Second, your end date is today (or midnight last night). Is there any chance that something hasn't replicated up to the minute? If click data was missing, would it cause an error, or be reported as no clicks?

As a quick gut check i ran our historical paulscore against the same time period, which resulted in

SELECT ROUND(SUM(pow_7)/COUNT(1), 2) as pow_7
  FROM ( SELECT SUM(IF(event_action = 'click',
                      POW(0.7, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) as pow_7
           FROM TestSearchSatisfaction2_15700292
          WHERE timestamp BETWEEN '20161001000000' AND '20161018000000'
            AND wiki = 'jawiki' AND event_source = 'fulltext'
            AND event_action IN ('searchResultPage', 'click')
          GROUP BY event_searchSessionId, event_source
       ) x
+-------+
| pow_7 |
+-------+
|  0.44 |
+-------+

So i'm more convinced that something is wrong, but still not sure what exactly. The non-latin part shouldn't be particularly important for matching up result sets (we use article id's for deciding if a good article is in the result set), but perhaps we are mangling the query somewhere between the queries leaving the browser as an event log and them showing up at the elasticsearch servers after flowing through mysql, and then the relforge software...

tl/dr: Both values for jawiki are surprisingly really bad. I wonder if there is something going wrong in relforge that mishandles the non-latin based queries. Will need to investigate.

Two things come to mind.

First, there are lots of layers in RelForge, so something could get lost in translation. Could this be the residue of non-Japanese queries if Japanese characters are getting mangled?

This is my best guess right now, somewhere they are getting mangled and since i don't particularly read japanese it's not obvious. I'm going to spend some time looking at a small sample of queries carefully though to see what's happening here.

Second, your end date is today (or midnight last night). Is there any chance that something hasn't replicated up to the minute? If click data was missing, would it cause an error, or be reported as no clicks?

Shouldn't be a big deal, although it means there might be a couple partial sessions at the end. I'll try backing it off a day just to make sure, but those should be such a small % it wouldn't have such a large effect.

After a reasonable bit of hacking on engineScore.py to figure out python unicode handling, i think i have something working now.

Same as before:

query = ./sql/extract_query_and_click_logs.all_clicks.yaml
wiki = jawiki
date_start = 20161001000000
date_end = 20161018000000

jawiki-bm25:

Loaded 2009 sessions with 2932 clicks and 4302 unique queries
Engine Score: 0.52
Histogram:
 0 (1517): *****************************************
 1 ( 286): *******
 2 ( 162): ****
 3 ( 113): ***
 4 (  82): **
 5 (  56): *
 6 (  56): *
 7 (  53): *
 8 (  31): 
 9 (  42): *
10 (  23): 
11 (  27): 
12 (  21): 
13 (  12): 
14 (  17): 
15 (  16): 
16 (  14): 
17 (  12): 
18 (  14):

jawiki-tfidf:

Loaded 2009 sessions with 2932 clicks and 4302 unique queries
Engine Score: 0.66
Histogram:
 0 (1774): ****************************************
 1 ( 431): *********
 2 ( 217): ****
 3 ( 128): **
 4 (  83): *
 5 (  66): *
 6 (  53): *
 7 (  43): 
 8 (  33): 
 9 (  38): 
10 (  27): 
11 (  29): 
12 (  28): 
13 (  23): 
14 (  19): 
15 (  11): 
16 (  12): 
17 (  24): 
18 (  21):

Those numbers, while disappointing, look legit. If the A/B test verifies this, then we know that BM25 is not going to be great for jawiki, but the good news is that it would verify that David's intuition is excellent and PaulScore is predictive of real world performance.

zhwiki-bm25:

Loaded 1377 sessions with 1692 clicks and 2986 unique queries
Engine Score: 0.53
Histogram:
 0 (1013): ****************************************
 1 ( 265): **********
 2 ( 128): *****
 3 (  79): ***
 4 (  74): **
 5 (  66): **
 6 (  46): *
 7 (  44): *
 8 (  26): *
 9 (  20): 
10 (  21): 
11 (  12): 
12 (  13): 
13 (  17): 
14 (  15): 
15 (   9): 
16 (   5): 
17 (  12): 
18 (   6):

zhwiki-tfidf:

Loaded 1377 sessions with 1692 clicks and 2986 unique queries
Engine Score: 0.66
Histogram:
 0 (1165): ****************************************
 1 ( 327): ***********
 2 ( 151): *****
 3 ( 102): ***
 4 (  54): *
 5 (  60): **
 6 (  42): *
 7 (  28): 
 8 (  22): 
 9 (  18): 
10 (  11): 
11 (  17): 
12 (   8): 
13 (   9): 
14 (   4): 
15 (   7): 
16 (   7): 
17 (   6): 
18 (   5):

Well, that's surprisingly consistent.

Had to change the from dates to make this pick up a few more sessions:

query = ./sql/extract_query_and_click_logs.all_clicks.yaml
wiki = thwiki
date_start = 20160601000000
date_end   = 20161018000000

thwiki-bm25:

Engine Score: 0.45
Histogram:
 0 (177): ********************************************
 1 ( 42): **********
 2 ( 24): ******
 3 ( 19): ****
 4 ( 13): ***
 5 (  5): *
 6 (  4): *
 7 (  3): 
 8 (  3): 
 9 (  3): 
10 (  0): 
11 (  2): 
12 (  4): *
13 (  2): 
14 (  2): 
15 (  2): 
16 (  1): 
17 (  3): 
18 (  2):

thwiki-tfidf:

Loaded 292 sessions with 414 clicks and 798 unique queries
Engine Score: 0.61
Histogram:
 0 (231): **********************************************
 1 ( 68): *************
 2 ( 35): *******
 3 ( 20): ****
 4 ( 15): ***
 5 (  6): *
 6 (  3): 
 7 (  6): *
 8 (  7): *
 9 (  6): *
10 (  5): *
11 (  6): *
12 (  0): 
13 (  4): 
14 (  4): 
15 (  4): 
16 (  2): 
17 (  4): 
18 (  0):

For comparison, using english:

enwiki-bm25:

Loaded 10000 sessions with 15040 clicks and 22265 unique queries
Engine Score: 0.54
Histogram:
 0 (8740): ****************************************
 1 (1820): ********
 2 ( 979): ****
 3 ( 674): ***
 4 ( 527): **
 5 ( 424): *
 6 ( 339): *
 7 ( 294): *
 8 ( 276): *
 9 ( 236): *
10 ( 245): *
11 ( 224): *
12 ( 204): 
13 ( 189): 
14 ( 156): 
15 ( 170): 
16 ( 149): 
17 ( 130): 
18 ( 136):

enwiki-tfidf:

Loaded 10000 sessions with 15040 clicks and 22265 unique queries
Engine Score: 0.57
Histogram:
 0 (8129): ****************************************
 1 (2221): **********
 2 (1263): ******
 3 ( 804): ***
 4 ( 567): **
 5 ( 461): **
 6 ( 367): *
 7 ( 327): *
 8 ( 255): *
 9 ( 240): *
10 ( 227): *
11 ( 202): 
12 ( 209): *
13 ( 176): 
14 ( 163): 
15 ( 125): 
16 ( 119): 
17 ( 144): 
18 ( 122):

Summary:

wikitfidfbm25diff
enwiki0.570.54-.03
zhwiki0.660.53-.13
thwiki0.610.45-.16
jawiki0.660.52-.14

The summary seems to be that paulscore is not a great predictor of improvement from tfidf -> bm25, likely due to it's heavy preference for the current top results. The magnitude of the difference in change betweeen english (which we know is better) and these languages is pretty large though, enough that this does seem to indicate these three languages will have a worse AB test result

Another possibility is that these languages are currently getting results that are so bad that PaulScore can't catch the improvements, because so many new awesome results are popping up in the top 3. Though probably not—since David's intuition is also that it will get worse. But I guess that's what the A/B test is for.

Since PaulScore didn't work out to really tell us anything (or at least, nothing concrete) due to english *also* showing worse, i spent some time digging through the literature for some other way we could evaluate this.

I turned up a highly cited paper on using click data as judgements, Accurately Interpreting Clickthrough Data as Implicit Feedback (https://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf)

Assume the following result set, where * marks a result that was clicked: 1*,2,3*,4,5*,6,7
The papers premise is that we can't use this as an indication that results 1, 3 or 5 are relevant to the query, but we can use it as an indication of relative ordering between results. Namely we can generate the following constraints, where a > b means result a should come before result b.

Click > Skip Above: 3 > 2, 5 > 2, 5 > 4
Click > Skip Next: 1 > 2, 3 > 4, 5 > 6

The author also presented a couple other options for generating constraints, but I chose to use these two. The papers author does a few tests to show that these constraints generate orderings that are within a few % points of orderings generated by human judges (aka discernatron).

I implemented this in relevance forge, along with some data collection in hadoop to pair together result sets from CirrusSearchRequestSet with clicks in the webrequest data. For extra fun I also built bootstrapped confidence intervals for these (side note / @todo bearloga gave me some suggestions on a simpler (hopefully faster, these take awhile) method of calculating confidence intervals here: I'd maybe checkout the median of the %'s and see what distribution that gives. looking at the picture you shared and it actually looks pretty normal to me, which makes sense because averages do converge to normal distributions because of central limit theorem. so you can actually avoid the bootstrapping and just use Normal(sample mean, sample sd). but a median's distribution cannot be figured out mathematically so simulation via bootstrapping is needed.).

These arn't run 100% the same though, due to variances in how much data i was able to source (over a one week period) for each wiki. enwiki was run with all clicks to 2k queries that were each issued by > 10 distinct ip addresses. zhwiki and jawiki were run with 1.6k queries issued by > 4 distinct ip addresses. thwiki was run with 1.4k queries, which was all the data i had (> 0 distinct ip addresses per query).

The score here is the average % of constraints satisfied per query. I don't think the numbers are particularly comparable between wikis, or even between input data used to feed the algo on the same wiki, but as long as the same input data is run against 2 or more engines those should be comparable (as in tfidf vs bm25 on same wiki from same click data)

wikitfidf95% cibm2595% ci
enwiki0.720.710582 <= x <= 0.7357020.740.728469 <= x <= 0.754066
zhwiki0.70510.691954 <= x <= 0.7181250.64920.634132 <= x <= 0.665284
jawiki0.68800.674493 <= x <= 0.7011090.68650.671432 <= x <= 0.701423
thwiki0.72920.714575 <= x <= 0.7437720.61640.597740 <= x <= 0.634675

Here was see that bm25 looks slightly better than tfidf on enwiki, although the confidence intervals slightly overlap. Basically bm25 is either the same or maybe slightly better than tfidf according to this metric.

For zhwiki, it goes the other way. tfidf clearly does better in this metric and is a significant result. Same for thwiki. Somehow though jawiki shows almost the exact same scores for tfidf and bm25. I'm not sure if thats a real effect, or if there was something wrong when i ran the test. Will try re-running just this part on monday after double checking all the related settings on relforge-search.eqiad.wmflabs.

For kicks i re-ran enwiki with 10k queries to see if it would narrow down the confidence intervals:

algoscore95% ci
tfidf0.73870.731428 <= x <= 0.745901
bm250.74800.740459 <= x <= 0.755545

The confidence intervals did narrow, from +- 0.015 to +- 0.0075. They still overlap though :S

Nice work, @EBernhardson!

I do wonder how fat the long tail is with the constraint of >n IPs issuing the same query. OTOH, if we can better satisfy the more common queries we're probably doing better overall, considering how the tail is not only long, but weird.

The long tail is *very* long. There are 971k distinct queries (against enwiki). There are 20k distinct queries issued by > 10 ip addresses. A more complete table of the number of distinct queries per group:

num_searchescount(distinct query)
102666
93387
84318
75849
68089
512286
420032
338926
2103241
1752363

Thanks @EBernhardson,

this is not exactly what I expected. I was expecting very bad results for ja/zh but almost unchanged results for th.
I'm really surprised by ja, it's the only one to use a custom CJK tokenizer for the text field but it still uses the aggressive "standard tokenizer" which will tokenize on every ideograms. I'll run relcomp on a 1k sample, I expect the ZRR to be strangely low with the perfield builder. If not I must be overlooking something...

Concerning th it's different, we have a custom thai analyzer that we hope to perform relatively well for the text field and the standard analyzer does not split anything on thai thus not affecting precision.

I was also surprised by the Thai results, since there's a custom analyzer, which should deal with the spacelessness better than the default analyzer.

One possibility for the lower scores is that currently clicked results are being moved out of the result sets. In the case of a > b if neither a or b are in the result set my implementation considers that a failed constraint. Not sure how common that is though...

a note for future work, i've noticed that http://web.stanford.edu/class/cs276/handouts/lecture8-evaluation_2014-one-per-page.pdf suggests using a method involving kendall's tau to evaluate pairwise preferences, vs. what i did here which is to compare the % of pairwise preferences that were valid