Page MenuHomePhabricator

Outcome of BM25 A/B test - our next steps on using BM25
Closed, ResolvedPublic

Description

Now that the analysis is done for the recent BM25 A/B test, there are a few different paths that we might want to go down. Let's use this ticket for documenting that forward path.

Event Timeline

I realized that our analysis chain is sub-optimal for scripts without spaces (japanese/chinese/...).
This directly affects the outcome of BM25 A/B test.
For such languages we use an analysis chain that will break words on every characters for the plain field, it means that words are broken. The effect on the indexed tokens is disastrous, 灯笼 (lantern) will be indexed as two tokens 灯 (light) and 笼 (cage??).
In this A/B test we tested a new per-field query builder, this builder tries to get rid of the QueryString, unfortunately QueryString is used with a feature that automatically convert non space separated tokens into a phrase query:
With QueryString N.A.S.A will be automatically transformed in a query like "n a s a" forcing a perfect phrase match. When this feature applied to Chinese: a search for 灯笼 will be transformed into "灯 笼".
Not using the QueryString could lead to very different results:

None of these strategies is ideal but I'd suggest to be conservative here and not enable the per field query builder on such languages.
I would prefer to invest some time and properly review our analysis config.

We'll chat more about this during the weekly meeting tomorrow.

We chatted this morning and have several tasks identified:

  • release BM25 on the top 10 languages that are "space happy" languages (T147508)
    • a review of languages of top-N wikis that are space happy, revealed:
      • English, German, Spanish, Russian, Portuguese, French, Italian, Polish, Dutch, Arabic
  • evaluate the release of BM25 on the bigger languages and then create a way forward to roll out to all the other wikis (that are also "space happy")
  • run PaulScore on BM25 on the Chinese, Japanese and Thai wiki's
    • this will help us identify how BM25 will react with the "not space happy" languages (based on @dcausse's note)
  • run an A/B test (using the code that was used for T143585) for ja (Japanese), zh (Chinese), th (Thai) (T147495)
  • based on the testing above, figure out how to make BM25 work nicely with languages that use spaces in their words to be sure we're finding the search results that the user desired. (T147512)
    • already identified: ja (Japanese), zh (Chinese), th (Thai), and km (Khmer)
    • identify additional languages
  • once we've gotten BM25 working properly on 'not space happy' languages, roll that out into production
debt claimed this task.
debt moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

Closing this ticket as the work that needed to be done is in my comment above.