Second BM25 A/B test (ja, zh, th)
Closed, ResolvedPublic
Actions

Description

We've been hard at work preparing a new query scoring method: BM25.

Now, let's run another test with the user satisfaction schema to see if it has any effect on user behavior for the following wikis:

ja (Japanese)
zh (Chinese)
th (Thai)

Details

Subject	Repo	Branch	Lines +/-
Turn on CirrusSearch bm25 A/B test for ja, zh and th	mediawiki/extensions/WikimediaEvents	wmf/1.28.0-wmf.22	+44 -4
Turn on CirrusSearch bm25 A/B test for ja, zh and th	mediawiki/extensions/WikimediaEvents	master	+44 -4
[cirrus] switch cirrus BM25 A/B test config to ja, zh, th	operations/mediawiki-config	master	+11 -264

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Deskana	T147495 Second BM25 A/B test (ja, zh, th)
Resolved	mpopov	T147496 Verify data pipeline for BM25 AB test
Resolved	debt	T147497 Turn on second BM25 test
Resolved	debt	T147498 reindex search cluster for BM25 test
Resolved	EBernhardson	T147499 Turn off second BM25 test
Resolved	• chelsyx	T147500 Analyze results of the second BM25 test
Resolved	debt	T147512 BM25: figure out how to utilize BM25 for languages that don't have spaces between words
Resolved	TJones	T149717 Identify all languages that don't have spaces

Event Timeline

debt created this task.Oct 5 2016, 7:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2016, 7:12 PM

debt created subtask T147496: Verify data pipeline for BM25 AB test .Oct 5 2016, 7:13 PM

debt created subtask T147497: Turn on second BM25 test .

debt created subtask T147498: reindex search cluster for BM25 test.Oct 5 2016, 7:16 PM

debt created subtask T147499: Turn off second BM25 test .

debt triaged this task as Medium priority.Oct 5 2016, 7:18 PM

debt created subtask T147500: Analyze results of the second BM25 test.

debt mentioned this in T147008: Outcome of BM25 A/B test - our next steps on using BM25.

debt mentioned this in T147501: Run Paulscore with BM25 on zh, ja, th.Oct 5 2016, 7:25 PM

debt mentioned this in T147502: Remove custom analysis chains from vagrant.Oct 5 2016, 7:28 PM

debt mentioned this in T147508: BM25: initial limited release into production.Oct 5 2016, 8:07 PM

debt mentioned this in T147512: BM25: figure out how to utilize BM25 for languages that don't have spaces between words.Oct 5 2016, 8:14 PM

Liuxinyu970226 subscribed.Oct 6 2016, 5:08 AM

Change 315250 had a related patch set uploaded (by DCausse):
[cirrus] switch cirrus BM25 A/B test config to ja, zh, th

https://gerrit.wikimedia.org/r/315250

gerritbot added a project: Patch-For-Review.Oct 12 2016, 8:46 AM

Since this patch also affects zhwiki.

Change 315250 merged by jenkins-bot:
[cirrus] switch cirrus BM25 A/B test config to ja, zh, th

https://gerrit.wikimedia.org/r/315250

Has it deployed or will it be deployed on the earliest train next week?

hasn't been deployed yet

Number of recorded sessions and events for fulltext search at 1:200 sampling for Monday morning through Sunday night:

mysql:research@dbstore1002.eqiad.wmnet [log]> select wiki, count(1), count(distinct event_searchSessionId) from TestSearchSatisfaction2_15700292 where timestamp between '20161010000000' and '20161017000000' and wiki in ('zhwiki', 'jawiki', 'thwiki') and event_source = 'fulltext'group by wiki;                      
+--------+----------+---------------------------------------+
| wiki   | count(1) | count(distinct event_searchSessionId) |
+--------+----------+---------------------------------------+
| jawiki |    12570 |                                  1702 |
| thwiki |      225 |                                    82 |
| zhwiki |     7591 |                                  1330 |
+--------+----------+---------------------------------------+

Last time we ran the test, we increased sampling from 1:200 to 1:66. We roughly maintained the 1:200 sampling going to the unbucketed satisfaction schema (for dashboards and such), and sent the remaining 1:132 into 1 of 5 buckets (control, allfield, inclinks, inclinks_pv, inclinks_pv_rev).

I talked to @dcausse and we think for this test it will be sufficient to have 2 buckets, control and inclinks_pv, to test that the new per-field bm25 builder is or is not sufficiently worse on languages that do not use spaces to separate words. Using the same split as before, that basically means without changing sampling, we should expect to have per-bucket event and session counts listed above. For comparison on the enwiki test we collected approximately 7k sessions and 23k events per bucket.

So the question is, @mpopov, do we need a larger sample, closer to the size in the enwiki test, or what is your preference?

Making a guess we want around 7.25k sessions per bucket, as the previous, so:

1:16 in the test
1:13 (~1:208 overall) diverted for dashboards
6:13 (~1:35 overall) per bucket

that would give us an something like

wiki	dashboard sessions	bucket sessions
jawiki	1635	9807
thwiki	79	472
zhwiki	1279	7673

seem reasonable? does thwiki need custom settings, or just not be tested?

Change 316494 had a related patch set uploaded (by EBernhardson):
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/316494

In T147495#2724232, @EBernhardson wrote:

seem reasonable? does thwiki need custom settings, or just not be tested?

I'd really like thwiki to be properly tested. All three wikis are using different analyzers, so it'd be nice to get three fairly different data points with which to try to extrapolate to other spaceless languages.

Seems like you'd have to make a sizable increase in the proportion of searches in the buckets to have any meaningful impact. Going up to 500 bucket sessions a week isn't really going to change anything.

On the other hand, having a large proportion of sessions in the test could be enough to affect the overall user experience on thwiki, and even show up on the dashboards if the effect is large enough (and I'm mostly worried about negative effects).

Ok, with all those caveats in place, what about upping the thwiki sample to 1:5? That'd be ~1500 bucket sessions a week (and ~250 dashboard sessions). We could let it run for at least two weeks instead of one, and that would get us up to 3K bucket sessions, which should be a reasonable number.

Though if Mikhail says ~450 is good, I'd go with that.

We would want to keep the dashboard portion somewhere close to 1:200 i think. thwiki has something around 16.4k sessions/wk. If we push the test to two weeks on thwiki that gives us 32.8k sessions to draw from. Trying to keep the dashboard portion around 1:200 our options look something like the following. I could of course find some other splits, but keeping the dashboard split at 1:n makes it easy to come up with these numbers.

test split	dashboards (overall)	per bucket (overall)	sessions per bucket
1:2	1:99 (1:198)	49:99 (49:198, ~1:4)	8117
1:3	1:67(1:201)	33:67 (l33:201, ~1:6)	5385
1:4	1:49 (1:196)	24:49 (24:196, ~1:8)	4016
1:5	1:39 (1:195)	19:39 (19:195, ~1:10)	3196

Even 1:2 doesn't seem that bad, although not that great. 3 out of 4 search sessions will get the classical tf/idf (control+dash+non-test). Going to 1:5 gets us to 9 out of 10 sessions receiving tf/idf. I suppose it all just depends on how many sessions we think we need (@mpopov ?). I'll adjust the code to handle the different values for thwiki we just need to decide on the right value.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Oct 19 2016, 5:42 PM

• Deskana added a subtask: T147512: BM25: figure out how to utilize BM25 for languages that don't have spaces between words.Oct 20 2016, 10:08 PM

@EBernhardson: if we plan to refer to PaulScore as a measure of success, then having at least a hundred search sessions w/ a clickthrough would be nice. That is, I'm thinking our sampling rate should be informed by thwiki's CTR?

SELECT
  month, SUM(clickthrough)/COUNT(1) AS ctr, COUNT(1) AS sessions, SUM(clickthrough) AS clickthroughs
FROM (
  SELECT
    SUBSTR(timestamp, 5, 2) AS month,
    CONCAT(LEFT(timestamp, 8), event_mwSessionId) AS session_id,
    SUM(IF(event_action = 'click', 1, 0)) > 0 AS clickthrough
  FROM `TestSearchSatisfaction2_15700292`
  WHERE
    LEFT(timestamp, 6) > '201607'
    AND wiki = 'thwiki'
    AND event_source = 'fulltext'
    AND event_action IN('searchResultPage', 'click')
  GROUP BY month, session_id
) clickthroughs
GROUP BY month;

month	ctr	sessions	clickthroughs
08	0.1674	448	75
09	0.2096	439	92
10	0.1863	263	45

So if we want >100 sessions with clickthroughs a week (to yield more-or-less reliable PaulScores), we'd have to make the sampling rate...let's see.

Take 440 sampled sessions a month which were sampled at a rate of 1 in 200, which is roughly a population of N=88K. If we want roughly at least 400 sessions w/ clickthroughs a month (~100/wk), then we say 400 = rate*88,000*0.15, which then gives us a sampling rate of 1 in 33. And that's just for event logging in general, not per bucket. I guess what I'm trying to say is 1:5 is A-OK :)

sounds like the current patch should be good to go then. Once merged i can ship this out to production and start the test.

Change 316494 merged by jenkins-bot:
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/316494

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2016-11-01_(1.29.0-wmf.1)).Oct 25 2016, 6:00 PM

Change 317861 had a related patch set uploaded (by EBernhardson):
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/317861

Change 317861 merged by jenkins-bot:
Turn on CirrusSearch bm25 A/B test for ja, zh and th

https://gerrit.wikimedia.org/r/317861

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-10-11_(1.28.0-wmf.22)).Oct 25 2016, 7:00 PM

debt closed subtask T147498: reindex search cluster for BM25 test as Resolved.Oct 27 2016, 9:05 PM

debt closed subtask T147497: Turn on second BM25 test as Resolved.Oct 27 2016, 9:14 PM

debt closed subtask T147496: Verify data pipeline for BM25 AB test as Resolved.Nov 8 2016, 8:13 PM