Outcome of BM25 A/B test - our next steps on using BM25
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Sep 29 2016, 5:45 PM

Description

Now that the analysis is done for the recent BM25 A/B test, there are a few different paths that we might want to go down. Let's use this ticket for documenting that forward path.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Deskana	T143585 Initial BM25 A/B Test
		Resolved		debt	T147008 Outcome of BM25 A/B test - our next steps on using BM25

Event Timeline

debt created this task.Sep 29 2016, 5:45 PM

I realized that our analysis chain is sub-optimal for scripts without spaces (japanese/chinese/...).
This directly affects the outcome of BM25 A/B test.
For such languages we use an analysis chain that will break words on every characters for the plain field, it means that words are broken. The effect on the indexed tokens is disastrous, 灯笼 (lantern) will be indexed as two tokens 灯 (light) and 笼 (cage??).
In this A/B test we tested a new per-field query builder, this builder tries to get rid of the QueryString, unfortunately QueryString is used with a feature that automatically convert non space separated tokens into a phrase query:
With QueryString N.A.S.A will be automatically transformed in a query like "n a s a" forcing a perfect phrase match. When this feature applied to Chinese: a search for 灯笼 will be transformed into "灯笼".
Not using the QueryString could lead to very different results:

with QueryString (forced phrase match): https://zh.wikipedia.org/w/index.php?search=~%E7%81%AF%E7%AC%BC&title=Special:%E6%90%9C%E7%B4%A2&go=%E5%89%8D%E5%BE%80
with perfield QueryBuilder : https://zh.wikipedia.org/w/index.php?search=~%E7%81%AF%E7%AC%BC&title=Special:%E6%90%9C%E7%B4%A2&go=%E5%89%8D%E5%BE%80&cirrusFTQBProfile=browser_tests

None of these strategies is ideal but I'd suggest to be conservative here and not enable the per field query builder on such languages.
I would prefer to invest some time and properly review our analysis config.

We'll chat more about this during the weekly meeting tomorrow.

We chatted this morning and have several tasks identified:

release BM25 on the top 10 languages that are "space happy" languages (T147508)
- a review of languages of top-N wikis that are space happy, revealed:
  - English, German, Spanish, Russian, Portuguese, French, Italian, Polish, Dutch, Arabic

evaluate the release of BM25 on the bigger languages and then create a way forward to roll out to all the other wikis (that are also "space happy")

run PaulScore on BM25 on the Chinese, Japanese and Thai wiki's
- this will help us identify how BM25 will react with the "not space happy" languages (based on @dcausse's note)
run an A/B test (using the code that was used for T143585) for ja (Japanese), zh (Chinese), th (Thai) (T147495)

based on the testing above, figure out how to make BM25 work nicely with languages that use spaces in their words to be sure we're finding the search results that the user desired. (T147512)
- already identified: ja (Japanese), zh (Chinese), th (Thai), and km (Khmer)
- identify additional languages

once we've gotten BM25 working properly on 'not space happy' languages, roll that out into production

debt mentioned this in T147501: Run Paulscore with BM25 on zh, ja, th.Oct 5 2016, 7:25 PM

debt mentioned this in T147502: Remove custom analysis chains from vagrant.Oct 5 2016, 7:28 PM

Closing this ticket as the work that needed to be done is in my comment above.

Outcome of BM25 A/B test - our next steps on using BM25Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Outcome of BM25 A/B test - our next steps on using BM25
Closed, ResolvedPublic
Actions

Related Objects
Search...