Now that the analysis is done for the recent BM25 A/B test, there are a few different paths that we might want to go down. Let's use this ticket for documenting that forward path.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Deskana | T143585 Initial BM25 A/B Test | |||
Resolved | debt | T147008 Outcome of BM25 A/B test - our next steps on using BM25 |
Event Timeline
I realized that our analysis chain is sub-optimal for scripts without spaces (japanese/chinese/...).
This directly affects the outcome of BM25 A/B test.
For such languages we use an analysis chain that will break words on every characters for the plain field, it means that words are broken. The effect on the indexed tokens is disastrous, 灯笼 (lantern) will be indexed as two tokens 灯 (light) and 笼 (cage??).
In this A/B test we tested a new per-field query builder, this builder tries to get rid of the QueryString, unfortunately QueryString is used with a feature that automatically convert non space separated tokens into a phrase query:
With QueryString N.A.S.A will be automatically transformed in a query like "n a s a" forcing a perfect phrase match. When this feature applied to Chinese: a search for 灯笼 will be transformed into "灯 笼".
Not using the QueryString could lead to very different results:
- with QueryString (forced phrase match): https://zh.wikipedia.org/w/index.php?search=~%E7%81%AF%E7%AC%BC&title=Special:%E6%90%9C%E7%B4%A2&go=%E5%89%8D%E5%BE%80
- with perfield QueryBuilder : https://zh.wikipedia.org/w/index.php?search=~%E7%81%AF%E7%AC%BC&title=Special:%E6%90%9C%E7%B4%A2&go=%E5%89%8D%E5%BE%80&cirrusFTQBProfile=browser_tests
None of these strategies is ideal but I'd suggest to be conservative here and not enable the per field query builder on such languages.
I would prefer to invest some time and properly review our analysis config.
We chatted this morning and have several tasks identified:
- release BM25 on the top 10 languages that are "space happy" languages (T147508)
- a review of languages of top-N wikis that are space happy, revealed:
- English, German, Spanish, Russian, Portuguese, French, Italian, Polish, Dutch, Arabic
- a review of languages of top-N wikis that are space happy, revealed:
- evaluate the release of BM25 on the bigger languages and then create a way forward to roll out to all the other wikis (that are also "space happy")
- run PaulScore on BM25 on the Chinese, Japanese and Thai wiki's
- run an A/B test (using the code that was used for T143585) for ja (Japanese), zh (Chinese), th (Thai) (T147495)
- based on the testing above, figure out how to make BM25 work nicely with languages that use spaces in their words to be sure we're finding the search results that the user desired. (T147512)
- already identified: ja (Japanese), zh (Chinese), th (Thai), and km (Khmer)
- identify additional languages
- once we've gotten BM25 working properly on 'not space happy' languages, roll that out into production