Page MenuHomePhabricator

Second BM25 A/B test (ja, zh, th)
Closed, ResolvedPublic

Description

We've been hard at work preparing a new query scoring method: BM25.

Now, let's run another test with the user satisfaction schema to see if it has any effect on user behavior for the following wikis:

  • ja (Japanese)
  • zh (Chinese)
  • th (Thai)

Event Timeline

debt created this task.Oct 5 2016, 7:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2016, 7:12 PM

Change 315250 had a related patch set uploaded (by DCausse):
[cirrus] switch cirrus BM25 A/B test config to ja, zh, th

https://gerrit.wikimedia.org/r/315250

Change 315250 merged by jenkins-bot:
[cirrus] switch cirrus BM25 A/B test config to ja, zh, th

https://gerrit.wikimedia.org/r/315250

Has it deployed or will it be deployed on the earliest train next week?

Number of recorded sessions and events for fulltext search at 1:200 sampling for Monday morning through Sunday night:

mysql:research@dbstore1002.eqiad.wmnet [log]> select wiki, count(1), count(distinct event_searchSessionId) from TestSearchSatisfaction2_15700292 where timestamp between '20161010000000' and '20161017000000' and wiki in ('zhwiki', 'jawiki', 'thwiki') and event_source = 'fulltext'group by wiki;                      
+--------+----------+---------------------------------------+
| wiki   | count(1) | count(distinct event_searchSessionId) |
+--------+----------+---------------------------------------+
| jawiki |    12570 |                                  1702 |
| thwiki |      225 |                                    82 |
| zhwiki |     7591 |                                  1330 |
+--------+----------+---------------------------------------+

Last time we ran the test, we increased sampling from 1:200 to 1:66. We roughly maintained the 1:200 sampling going to the unbucketed satisfaction schema (for dashboards and such), and sent the remaining 1:132 into 1 of 5 buckets (control, allfield, inclinks, inclinks_pv, inclinks_pv_rev).

I talked to @dcausse and we think for this test it will be sufficient to have 2 buckets, control and inclinks_pv, to test that the new per-field bm25 builder is or is not sufficiently worse on languages that do not use spaces to separate words. Using the same split as before, that basically means without changing sampling, we should expect to have per-bucket event and session counts listed above. For comparison on the enwiki test we collected approximately 7k sessions and 23k events per bucket.

So the question is, @mpopov, do we need a larger sample, closer to the size in the enwiki test, or what is your preference?

Making a guess we want around 7.25k sessions per bucket, as the previous, so:

1:16 in the test
1:13 (~1:208 overall) diverted for dashboards
6:13 (~1:35 overall) per bucket

that would give us an something like

wikidashboard sessionsbucket sessions
jawiki16359807
thwiki79472
zhwiki12797673

seem reasonable? does thwiki need custom settings, or just not be tested?

Change 316494 had a related patch set uploaded (by EBernhardson):
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/316494

seem reasonable? does thwiki need custom settings, or just not be tested?

I'd really like thwiki to be properly tested. All three wikis are using different analyzers, so it'd be nice to get three fairly different data points with which to try to extrapolate to other spaceless languages.

Seems like you'd have to make a sizable increase in the proportion of searches in the buckets to have any meaningful impact. Going up to 500 bucket sessions a week isn't really going to change anything.

On the other hand, having a large proportion of sessions in the test could be enough to affect the overall user experience on thwiki, and even show up on the dashboards if the effect is large enough (and I'm mostly worried about negative effects).

Ok, with all those caveats in place, what about upping the thwiki sample to 1:5? That'd be ~1500 bucket sessions a week (and ~250 dashboard sessions). We could let it run for at least two weeks instead of one, and that would get us up to 3K bucket sessions, which should be a reasonable number.

Though if Mikhail says ~450 is good, I'd go with that.

We would want to keep the dashboard portion somewhere close to 1:200 i think. thwiki has something around 16.4k sessions/wk. If we push the test to two weeks on thwiki that gives us 32.8k sessions to draw from. Trying to keep the dashboard portion around 1:200 our options look something like the following. I could of course find some other splits, but keeping the dashboard split at 1:n makes it easy to come up with these numbers.

test splitdashboards (overall)per bucket (overall)sessions per bucket
1:21:99 (1:198)49:99 (49:198, ~1:4)8117
1:31:67(1:201)33:67 (l33:201, ~1:6)5385
1:41:49 (1:196)24:49 (24:196, ~1:8)4016
1:51:39 (1:195)19:39 (19:195, ~1:10)3196

Even 1:2 doesn't seem that bad, although not that great. 3 out of 4 search sessions will get the classical tf/idf (control+dash+non-test). Going to 1:5 gets us to 9 out of 10 sessions receiving tf/idf. I suppose it all just depends on how many sessions we think we need (@mpopov ?). I'll adjust the code to handle the different values for thwiki we just need to decide on the right value.

@EBernhardson: if we plan to refer to PaulScore as a measure of success, then having at least a hundred search sessions w/ a clickthrough would be nice. That is, I'm thinking our sampling rate should be informed by thwiki's CTR?

SELECT
  month, SUM(clickthrough)/COUNT(1) AS ctr, COUNT(1) AS sessions, SUM(clickthrough) AS clickthroughs
FROM (
  SELECT
    SUBSTR(timestamp, 5, 2) AS month,
    CONCAT(LEFT(timestamp, 8), event_mwSessionId) AS session_id,
    SUM(IF(event_action = 'click', 1, 0)) > 0 AS clickthrough
  FROM `TestSearchSatisfaction2_15700292`
  WHERE
    LEFT(timestamp, 6) > '201607'
    AND wiki = 'thwiki'
    AND event_source = 'fulltext'
    AND event_action IN('searchResultPage', 'click')
  GROUP BY month, session_id
) clickthroughs
GROUP BY month;
monthctrsessionsclickthroughs
080.167444875
090.209643992
100.186326345

So if we want >100 sessions with clickthroughs a week (to yield more-or-less reliable PaulScores), we'd have to make the sampling rate...let's see.

Take 440 sampled sessions a month which were sampled at a rate of 1 in 200, which is roughly a population of N=88K. If we want roughly at least 400 sessions w/ clickthroughs a month (~100/wk), then we say 400 = rate*88,000*0.15, which then gives us a sampling rate of 1 in 33. And that's just for event logging in general, not per bucket. I guess what I'm trying to say is 1:5 is A-OK :)

sounds like the current patch should be good to go then. Once merged i can ship this out to production and start the test.

Change 316494 merged by jenkins-bot:
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/316494

Change 317861 had a related patch set uploaded (by EBernhardson):
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/317861

Change 317861 merged by jenkins-bot:
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/317861

Deskana closed this task as Resolved.Dec 9 2016, 3:17 PM
Deskana claimed this task.