Page MenuHomePhabricator

[EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages
Open, NormalPublic


This ticket is to track work by @Julia.glen on implementing NLP Search Suggestion “Method 1” for 10 additional languages.

From the parent task:

  • Method 1: Mine search logs for common queries and create an efficient method for choosing candidates to make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. Only applicable for languages with relatively small writing systems (alphabets, abjads, syllabaries, etc.).

The list of 10 languages is to be TBD, but will probably be based mostly on search volume, subject to the constraint that the writing systems have a relatively small number of characters (Germanic, Romance, Slavic, Semitic, Turkic, Iranian, ...)

Please create sub-tasks or add details here as necessary.

Event Timeline

TJones created this task.Jan 3 2019, 8:04 PM
TJones updated the task description. (Show Details)
TJones triaged this task as Normal priority.
TJones updated the task description. (Show Details)
debt moved this task from This Quarter to [epic] on the Discovery-Search board.Jan 29 2019, 6:50 PM
TJones added a comment.EditedFeb 11 2019, 2:47 PM

What languages should we initially investigate?

Based on the languages breakdown on the dashboards, the top 25 languages by search volume are: English (en), German (de), French (fr), Russian (ru), Spanish (es), Italian (it), Japanese (ja), Arabic (ar), Dutch (nl), Swedish (sv), Chinese (sh), Persian (fa), Czech (cs), Polish (pl), Finnish (fi), Hebrew (he), Portuguese (pt), Norwegian (Bokmal) (no or nb?), Ukrainian (uk), [>1%>] Indonesian (id), Hungarian (hu), Greek (el), Korean (ko), Catalan (ca), Vietnamese (vi).

A few thoughts:

  • I'm not 100% sure the language breakdown stats are exact, but I think they still provide decent guidance.
  • Chinese and Japanese are out, because their writing systems are too big, and should be covered by Method 2. (T212891)
  • Korean may be out, too. Korean syllables can be decomposed down to ~40 characters, though some Chinese characters may still occur. We can decide decomposition is too complex for now and catch Korean with Method 2.
  • English is covered by Method 0 (T212888), so should we skip it? Two further thoughts: Can we use two methods at once? Do we want to compare the performance of Method 0 against Method 1?
  • Indonesian and those after it get less than 1% of search volume—with caveats: Indonesian is at 0.997%, so it may above or below the line in any given week; the numbers may not be exact; and the volume is, I think, across all projects, though we're currently focusing on Wikipedia—I expect Wikipedia to dominate, but that may not actually be the case for all languages.
    • Vietnamese is at 0.851%, and everything below that drops off more quickly (in the current sample)
  • If there's enough volume, I'd prefer the initial implementation to maybe try a broader range of language families. If it works well for French and Spanish, it'll probably work well for Italian and Catalan (modulo sufficient search volume), and we can do additional training for those languages after the initial implementation/investigation.
    • Maximizing linguistic diversity (and skipping the ones we need to skip), I'd initially reduce the list to (English), German, French, Russian, Spanish, Arabic, Persian, Czech, Finnish, Hebrew, Indonesian, Greek, (Korean), Vietnamese.
      • This keeps the biggest 4 or 5, depending on whether we include English in the list.
      • I want Finnish, Indonesian, and Vietnamese for maximal diversity (though Indonesian or Vietnamese may not have the volume needed).
      • I can't easily choose among Arabic, Persian, and Hebrew. Arabic and Persian share a script, but are very different language families. Arabic and Hebrew are both Semitic and their writing systems share some features, but they are still different.
      • Greek probably isn't really that different from other Indo-European languages with an alphabet, so it'd probably be the first one I dropped, then Persian. If we can't have Korean and want to skip English, that's 10.

Other thoughts?

Let’s use English for both Method 0 and Method 1? It’s our most voluminous log language and will give us an opportunity to run two methods side by side and possibly in combination.

Suggested list of 10:
If we keep English and exclude Czech (we already have one Slavic Language, Russian), suggested list of 10 is English, German, French, Russian, Spanish, Arabic, Finnish, Hebrew, Indonesian, Vietnamese. That way we have languages from different groups and a good mix of large, medium and small log volumes.

What do you think?

Sounds good to me! If it turns out that the smallest volume languages have trouble, we can fall back to larger languages on the list.

TJones renamed this task from [EPIC-ish][Milestone 2] Implement NLP Search Suggestion Method 1 for 10 languages to [EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages.Mar 20 2019, 3:44 PM