tl/dr: Both values for jawiki are surprisingly really bad. I wonder if there is something going wrong in relforge that mishandles
the non-latin based queries. Will need to investigate.

Using:

query = ./sql/extract_query_and_click_logs.all_clicks.yaml⏎
wiki = jawiki⏎
date_start = 20161001000000⏎
date_end = 20161018000000⏎

Also using the default paulscore factor of 0.7

jawiki-bm25:

Loaded 2009 sessions with 2932 clicks and 4311 unique queries
Engine Score: 0.08
Histogram:
 0 (205): *****************************************
 1 ( 50): **********
 2 ( 29): *****
 3 ( 17): ***
 4 ( 10): **
 5 ( 10): **
 6 (  6): *
 7 (  9): *
 8 (  1): 
 9 (  0): 
10 (  5): *
11 (  5): *
12 (  7): *
13 (  2): 
14 (  3): 
15 (  2): 
16 (  1): 
17 (  2): 
18 (  2):

jawiki-tfidf:

Loaded 2009 sessions with 2932 clicks and 4311 unique queries
Engine Score: 0.09
Histogram:
 0 (229): *********************************************
 1 ( 60): ************
 2 ( 23): ****
 3 ( 19): ***
 4 (  6): *
 5 (  8): *
 6 (  4): 
 7 (  6): *
 8 (  2): 
 9 (  3): 
10 (  2): 
11 (  4): 
12 (  4): 
13 (  1): 
14 (  4): 
15 (  0): 
16 (  1): 
17 (  2): 
18 (  1):

tl/dr: Both values for jawiki are surprisingly really bad. I wonder if there is something going wrong in relforge that mishandles the non-latin based queries. Will need to investigate.

Two things come to mind.

First, there are lots of layers in RelForge, so something could get lost in translation. Could this be the residue of non-Japanese queries if Japanese characters are getting mangled?

Second, your end date is today (or midnight last night). Is there any chance that something hasn't replicated up to the minute? If click data was missing, would it cause an error, or be reported as no clicks?

As a quick gut check i ran our historical paulscore against the same time period, which resulted in

SELECT ROUND(SUM(pow_7)/COUNT(1), 2) as pow_7
  FROM ( SELECT SUM(IF(event_action = 'click',
                      POW(0.7, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) as pow_7
           FROM TestSearchSatisfaction2_15700292
          WHERE timestamp BETWEEN '20161001000000' AND '20161018000000'
            AND wiki = 'jawiki' AND event_source = 'fulltext'
            AND event_action IN ('searchResultPage', 'click')
          GROUP BY event_searchSessionId, event_source
       ) x

+-------+
| pow_7 |
+-------+
|  0.44 |
+-------+

So i'm more convinced that something is wrong, but still not sure what exactly. The non-latin part shouldn't be particularly important for matching up result sets (we use article id's for deciding if a good article is in the result set), but perhaps we are mangling the query somewhere between the queries leaving the browser as an event log and them showing up at the elasticsearch servers after flowing through mysql, and then the relforge software...

In T147501#2727211, @TJones wrote:

tl/dr: Both values for jawiki are surprisingly really bad. I wonder if there is something going wrong in relforge that mishandles the non-latin based queries. Will need to investigate.

Two things come to mind.

First, there are lots of layers in RelForge, so something could get lost in translation. Could this be the residue of non-Japanese queries if Japanese characters are getting mangled?

This is my best guess right now, somewhere they are getting mangled and since i don't particularly read japanese it's not obvious. I'm going to spend some time looking at a small sample of queries carefully though to see what's happening here.

Second, your end date is today (or midnight last night). Is there any chance that something hasn't replicated up to the minute? If click data was missing, would it cause an error, or be reported as no clicks?

Shouldn't be a big deal, although it means there might be a couple partial sessions at the end. I'll try backing it off a day just to make sure, but those should be such a small % it wouldn't have such a large effect.

After a reasonable bit of hacking on engineScore.py to figure out python unicode handling, i think i have something working now.

Same as before:

query = ./sql/extract_query_and_click_logs.all_clicks.yaml
wiki = jawiki
date_start = 20161001000000
date_end = 20161018000000

jawiki-bm25:

Loaded 2009 sessions with 2932 clicks and 4302 unique queries
Engine Score: 0.52
Histogram:
 0 (1517): *****************************************
 1 ( 286): *******
 2 ( 162): ****
 3 ( 113): ***
 4 (  82): **
 5 (  56): *
 6 (  56): *
 7 (  53): *
 8 (  31): 
 9 (  42): *
10 (  23): 
11 (  27): 
12 (  21): 
13 (  12): 
14 (  17): 
15 (  16): 
16 (  14): 
17 (  12): 
18 (  14):

jawiki-tfidf:

Loaded 2009 sessions with 2932 clicks and 4302 unique queries
Engine Score: 0.66
Histogram:
 0 (1774): ****************************************
 1 ( 431): *********
 2 ( 217): ****
 3 ( 128): **
 4 (  83): *
 5 (  66): *
 6 (  53): *
 7 (  43): 
 8 (  33): 
 9 (  38): 
10 (  27): 
11 (  29): 
12 (  28): 
13 (  23): 
14 (  19): 
15 (  11): 
16 (  12): 
17 (  24): 
18 (  21):

Those numbers, while disappointing, look legit. If the A/B test verifies this, then we know that BM25 is not going to be great for jawiki, but the good news is that it would verify that David's intuition is excellent and PaulScore is predictive of real world performance.

zhwiki-bm25:

Loaded 1377 sessions with 1692 clicks and 2986 unique queries
Engine Score: 0.53
Histogram:
 0 (1013): ****************************************
 1 ( 265): **********
 2 ( 128): *****
 3 (  79): ***
 4 (  74): **
 5 (  66): **
 6 (  46): *
 7 (  44): *
 8 (  26): *
 9 (  20): 
10 (  21): 
11 (  12): 
12 (  13): 
13 (  17): 
14 (  15): 
15 (   9): 
16 (   5): 
17 (  12): 
18 (   6):

zhwiki-tfidf:

Loaded 1377 sessions with 1692 clicks and 2986 unique queries
Engine Score: 0.66
Histogram:
 0 (1165): ****************************************
 1 ( 327): ***********
 2 ( 151): *****
 3 ( 102): ***
 4 (  54): *
 5 (  60): **
 6 (  42): *
 7 (  28): 
 8 (  22): 
 9 (  18): 
10 (  11): 
11 (  17): 
12 (   8): 
13 (   9): 
14 (   4): 
15 (   7): 
16 (   7): 
17 (   6): 
18 (   5):

Well, that's surprisingly consistent.

Had to change the from dates to make this pick up a few more sessions:

query = ./sql/extract_query_and_click_logs.all_clicks.yaml
wiki = thwiki
date_start = 20160601000000
date_end   = 20161018000000

thwiki-bm25:

Engine Score: 0.45
Histogram:
 0 (177): ********************************************
 1 ( 42): **********
 2 ( 24): ******
 3 ( 19): ****
 4 ( 13): ***
 5 (  5): *
 6 (  4): *
 7 (  3): 
 8 (  3): 
 9 (  3): 
10 (  0): 
11 (  2): 
12 (  4): *
13 (  2): 
14 (  2): 
15 (  2): 
16 (  1): 
17 (  3): 
18 (  2):

thwiki-tfidf:

Loaded 292 sessions with 414 clicks and 798 unique queries
Engine Score: 0.61
Histogram:
 0 (231): **********************************************
 1 ( 68): *************
 2 ( 35): *******
 3 ( 20): ****
 4 ( 15): ***
 5 (  6): *
 6 (  3): 
 7 (  6): *
 8 (  7): *
 9 (  6): *
10 (  5): *
11 (  6): *
12 (  0): 
13 (  4): 
14 (  4): 
15 (  4): 
16 (  2): 
17 (  4): 
18 (  0):

For comparison, using english:

enwiki-bm25:

Loaded 10000 sessions with 15040 clicks and 22265 unique queries
Engine Score: 0.54
Histogram:
 0 (8740): ****************************************
 1 (1820): ********
 2 ( 979): ****
 3 ( 674): ***
 4 ( 527): **
 5 ( 424): *
 6 ( 339): *
 7 ( 294): *
 8 ( 276): *
 9 ( 236): *
10 ( 245): *
11 ( 224): *
12 ( 204): 
13 ( 189): 
14 ( 156): 
15 ( 170): 
16 ( 149): 
17 ( 130): 
18 ( 136):

enwiki-tfidf:

Loaded 10000 sessions with 15040 clicks and 22265 unique queries
Engine Score: 0.57
Histogram:
 0 (8129): ****************************************
 1 (2221): **********
 2 (1263): ******
 3 ( 804): ***
 4 ( 567): **
 5 ( 461): **
 6 ( 367): *
 7 ( 327): *
 8 ( 255): *
 9 ( 240): *
10 ( 227): *
11 ( 202): 
12 ( 209): *
13 ( 176): 
14 ( 163): 
15 ( 125): 
16 ( 119): 
17 ( 144): 
18 ( 122):

Summary:

wiki	tfidf	bm25	diff
enwiki	0.57	0.54	-.03
zhwiki	0.66	0.53	-.13
thwiki	0.61	0.45	-.16
jawiki	0.66	0.52	-.14

The summary seems to be that paulscore is not a great predictor of improvement from tfidf -> bm25, likely due to it's heavy preference for the current top results. The magnitude of the difference in change betweeen english (which we know is better) and these languages is pretty large though, enough that this does seem to indicate these three languages will have a worse AB test result

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Oct 19 2016, 5:00 PM

Another possibility is that these languages are currently getting results that are so bad that PaulScore can't catch the improvements, because so many new awesome results are popping up in the top 3. Though probably not—since David's intuition is also that it will get worse. But I guess that's what the A/B test is for.

Since PaulScore didn't work out to really tell us anything (or at least, nothing concrete) due to english *also* showing worse, i spent some time digging through the literature for some other way we could evaluate this.

I turned up a highly cited paper on using click data as judgements, Accurately Interpreting Clickthrough Data as Implicit Feedback (https://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf)

Assume the following result set, where * marks a result that was clicked: 1*,2,3*,4,5*,6,7
The papers premise is that we can't use this as an indication that results 1, 3 or 5 are relevant to the query, but we can use it as an indication of relative ordering between results. Namely we can generate the following constraints, where a > b means result a should come before result b.

Click > Skip Above: 3 > 2, 5 > 2, 5 > 4
Click > Skip Next: 1 > 2, 3 > 4, 5 > 6

The author also presented a couple other options for generating constraints, but I chose to use these two. The papers author does a few tests to show that these constraints generate orderings that are within a few % points of orderings generated by human judges (aka discernatron).

I implemented this in relevance forge, along with some data collection in hadoop to pair together result sets from CirrusSearchRequestSet with clicks in the webrequest data. For extra fun I also built bootstrapped confidence intervals for these (side note / @todo bearloga gave me some suggestions on a simpler (hopefully faster, these take awhile) method of calculating confidence intervals here: I'd maybe checkout the median of the %'s and see what distribution that gives. looking at the picture you shared and it actually looks pretty normal to me, which makes sense because averages do converge to normal distributions because of central limit theorem. so you can actually avoid the bootstrapping and just use Normal(sample mean, sample sd). but a median's distribution cannot be figured out mathematically so simulation via bootstrapping is needed.).

These arn't run 100% the same though, due to variances in how much data i was able to source (over a one week period) for each wiki. enwiki was run with all clicks to 2k queries that were each issued by > 10 distinct ip addresses. zhwiki and jawiki were run with 1.6k queries issued by > 4 distinct ip addresses. thwiki was run with 1.4k queries, which was all the data i had (> 0 distinct ip addresses per query).

The score here is the average % of constraints satisfied per query. I don't think the numbers are particularly comparable between wikis, or even between input data used to feed the algo on the same wiki, but as long as the same input data is run against 2 or more engines those should be comparable (as in tfidf vs bm25 on same wiki from same click data)

wiki	tfidf	95% ci	bm25	95% ci
enwiki	0.72	0.710582 <= x <= 0.735702	0.74	0.728469 <= x <= 0.754066
zhwiki	0.7051	0.691954 <= x <= 0.718125	0.6492	0.634132 <= x <= 0.665284
jawiki	0.6880	0.674493 <= x <= 0.701109	0.6865	0.671432 <= x <= 0.701423
thwiki	0.7292	0.714575 <= x <= 0.743772	0.6164	0.597740 <= x <= 0.634675

Here was see that bm25 looks slightly better than tfidf on enwiki, although the confidence intervals slightly overlap. Basically bm25 is either the same or maybe slightly better than tfidf according to this metric.

For zhwiki, it goes the other way. tfidf clearly does better in this metric and is a significant result. Same for thwiki. Somehow though jawiki shows almost the exact same scores for tfidf and bm25. I'm not sure if thats a real effect, or if there was something wrong when i ran the test. Will try re-running just this part on monday after double checking all the related settings on relforge-search.eqiad.wmflabs.

For kicks i re-ran enwiki with 10k queries to see if it would narrow down the confidence intervals:

algo	score	95% ci
tfidf	0.7387	0.731428 <= x <= 0.745901
bm25	0.7480	0.740459 <= x <= 0.755545

The confidence intervals did narrow, from +- 0.015 to +- 0.0075. They still overlap though :S

Nemo_bis subscribed.Oct 22 2016, 6:36 AM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 24 2016, 3:26 PM

Nice work, @EBernhardson!

I do wonder how fat the long tail is with the constraint of >n IPs issuing the same query. OTOH, if we can better satisfy the more common queries we're probably doing better overall, considering how the tail is not only long, but weird.

The long tail is *very* long. There are 971k distinct queries (against enwiki). There are 20k distinct queries issued by > 10 ip addresses. A more complete table of the number of distinct queries per group:

num_searches	count(distinct query)
10	2666
9	3387
8	4318
7	5849
6	8089
5	12286
4	20032
3	38926
2	103241
1	752363

Thanks @EBernhardson,

this is not exactly what I expected. I was expecting very bad results for ja/zh but almost unchanged results for th.
I'm really surprised by ja, it's the only one to use a custom CJK tokenizer for the text field but it still uses the aggressive "standard tokenizer" which will tokenize on every ideograms. I'll run relcomp on a 1k sample, I expect the ZRR to be strangely low with the perfield builder. If not I must be overlooking something...

Concerning th it's different, we have a custom thai analyzer that we hope to perform relatively well for the text field and the standard analyzer does not split anything on thai thus not affecting precision.

I was also surprised by the Thai results, since there's a custom analyzer, which should deal with the spacelessness better than the default analyzer.

TJones mentioned this in T148811: Evaluate using SERP click throughs to build a search feedback loop.Oct 24 2016, 5:15 PM

One possibility for the lower scores is that currently clicked results are being moved out of the result sets. In the case of a > b if neither a or b are in the result set my implementation considers that a failed constraint. Not sure how common that is though...

debt mentioned this in T147963: Compare search engine nDCG scores to human nDCG scores.Oct 25 2016, 5:42 PM

a note for future work, i've noticed that http://web.stanford.edu/class/cs276/handouts/lecture8-evaluation_2014-one-per-page.pdf suggests using a method involving kendall's tau to evaluate pairwise preferences, vs. what i did here which is to compare the % of pairwise preferences that were valid

Liuxinyu970226 subscribed.Oct 29 2016, 2:09 PM

• Deskana closed this task as Resolved.Nov 17 2016, 9:47 PM

Liuxinyu970226 unsubscribed.Nov 18 2016, 9:57 AM

Run Paulscore with BM25 on zh, ja, thClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Run Paulscore with BM25 on zh, ja, th
Closed, ResolvedPublic
Actions