Page MenuHomePhabricator

BM25: figure out how to utilize BM25 for languages that don't have spaces between words
Closed, ResolvedPublic

Description

Based on the testing that will be done in T147495 and T147501, let's figure out how best to deal with those languages that don't have spaces between words (using ICU tokenization).

  • also need to identify all languages that don't have spaces; we've already identified: ja (Japanese), zh (Chinese), th (Thai), and km (Khmer)

Event Timeline

TJones renamed this task from BM25: figure out how to utilize for languages that have spaces in words to BM25: figure out how to utilize BM25 for languages that don't have spaces in words.Oct 6 2016, 5:48 PM
TJones updated the task description. (Show Details)
TJones renamed this task from BM25: figure out how to utilize BM25 for languages that don't have spaces in words to BM25: figure out how to utilize BM25 for languages that don't have spaces between words.Oct 6 2016, 5:53 PM
TJones updated the task description. (Show Details)
Deskana added a subscriber: Deskana.

We're currently expecting the A/B test (done in T147495) to be unsuccessful and show that we need to do this task. So, I've marked T147495 as a parent task.

Deskana changed the task status from Open to Stalled.Nov 15 2016, 6:10 PM

Stalled, waiting on the outcome of the A/B test analysis in T147500.

Still have this on our radar....but have to figure out a few things first, keeping in the backlog at the bottom for now.

debt claimed this task.

Closing this as we have solved the question on which languages to do and the analysis. We'll create new tickets as we start working on the languages.