[EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages
Open, HighPublic
Actions

Assigned To

None

Authored By

	TJones
	Jan 3 2019, 8:04 PM

Description

This ticket is to track work by @Julia.glen on implementing NLP Search Suggestion “Method 1” for 10 additional languages.

From the parent task:

Method 1: Mine search logs for common queries and create an efficient method for choosing candidates to make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. Only applicable for languages with relatively small writing systems (alphabets, abjads, syllabaries, etc.).

The list of 10 languages is to be TBD, but will probably be based mostly on search volume, subject to the constraint that the writing systems have a relatively small number of characters (Germanic, Romance, Slavic, Semitic, Turkic, Iranian, ...)

Please create sub-tasks or add details here as necessary.

Related Objects
Search...

Status	Assigned	Task
Open	None	T212884 [EPIC] Improve Search Suggestions with NLP (Did You Mean / Glent)
Open	None	T212889 [EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages
Resolved	TJones	T232760 Analysis of Method 1 Suggestion results
Resolved	TJones	T238151 Tune Glent Method 1 algorithm
Resolved	EBernhardson	T247469 Cirrus needs to pass along search syntax info to Glent for M1
Resolved	EBernhardson	T247898 Add new columns for Glent Method 1
Resolved	TJones	T262610 Enable ICUTokNorm() for Glent M0 and M1
Open	None	T262612 Run an A/B test using suggestions generated using glent Method 1

Event Timeline

TJones created this task.Jan 3 2019, 8:04 PM

TJones updated the task description. (Show Details)

TJones triaged this task as Medium priority.Jan 3 2019, 8:07 PM

TJones mentioned this in T212884: [EPIC] Improve Search Suggestions with NLP (Did You Mean / Glent).

TJones updated the task description. (Show Details)

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).

• EBjune moved this task from needs triage to This Quarter on the Discovery-Search board.Jan 17 2019, 6:22 PM

debt moved this task from This Quarter to [epic] on the Discovery-Search board.Jan 29 2019, 6:50 PM

What languages should we initially investigate?

Based on the languages breakdown on the dashboards, the top 25 languages by search volume are: English (en), German (de), French (fr), Russian (ru), Spanish (es), Italian (it), Japanese (ja), Arabic (ar), Dutch (nl), Swedish (sv), Chinese (sh), Persian (fa), Czech (cs), Polish (pl), Finnish (fi), Hebrew (he), Portuguese (pt), Norwegian (Bokmal) (no or nb?), Ukrainian (uk), [>1%>] Indonesian (id), Hungarian (hu), Greek (el), Korean (ko), Catalan (ca), Vietnamese (vi).

A few thoughts:

I'm not 100% sure the language breakdown stats are exact, but I think they still provide decent guidance.
Chinese and Japanese are out, because their writing systems are too big, and should be covered by Method 2. (T212891)
Korean may be out, too. Korean syllables can be decomposed down to ~40 characters, though some Chinese characters may still occur. We can decide decomposition is too complex for now and catch Korean with Method 2.
English is covered by Method 0 (T212888), so should we skip it? Two further thoughts: Can we use two methods at once? Do we want to compare the performance of Method 0 against Method 1?
Indonesian and those after it get less than 1% of search volume—with caveats: Indonesian is at 0.997%, so it may above or below the line in any given week; the numbers may not be exact; and the volume is, I think, across all projects, though we're currently focusing on Wikipedia—I expect Wikipedia to dominate, but that may not actually be the case for all languages.
- Vietnamese is at 0.851%, and everything below that drops off more quickly (in the current sample)
If there's enough volume, I'd prefer the initial implementation to maybe try a broader range of language families. If it works well for French and Spanish, it'll probably work well for Italian and Catalan (modulo sufficient search volume), and we can do additional training for those languages after the initial implementation/investigation.
- Maximizing linguistic diversity (and skipping the ones we need to skip), I'd initially reduce the list to (English), German, French, Russian, Spanish, Arabic, Persian, Czech, Finnish, Hebrew, Indonesian, Greek, (Korean), Vietnamese.
  - This keeps the biggest 4 or 5, depending on whether we include English in the list.
  - I want Finnish, Indonesian, and Vietnamese for maximal diversity (though Indonesian or Vietnamese may not have the volume needed).
  - I can't easily choose among Arabic, Persian, and Hebrew. Arabic and Persian share a script, but are very different language families. Arabic and Hebrew are both Semitic and their writing systems share some features, but they are still different.
  - Greek probably isn't really that different from other Indo-European languages with an alphabet, so it'd probably be the first one I dropped, then Persian. If we can't have Korean and want to skip English, that's 10.

Other thoughts?

Let’s use English for both Method 0 and Method 1? It’s our most voluminous log language and will give us an opportunity to run two methods side by side and possibly in combination.

Suggested list of 10:
If we keep English and exclude Czech (we already have one Slavic Language, Russian), suggested list of 10 is English, German, French, Russian, Spanish, Arabic, Finnish, Hebrew, Indonesian, Vietnamese. That way we have languages from different groups and a good mix of large, medium and small log volumes.

What do you think?

Sounds good to me! If it turns out that the smallest volume languages have trouble, we can fall back to larger languages on the list.

TJones renamed this task from [EPIC-ish][Milestone 2] Implement NLP Search Suggestion Method 1 for 10 languages to [EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages.Mar 20 2019, 3:44 PM

Gehel closed subtask T232760: Analysis of Method 1 Suggestion results as Resolved.Oct 29 2019, 5:51 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.Sep 9 2020, 2:46 PM

CBogen moved this task from Incoming to Epics on the Discovery-Search (Current work) board.Sep 9 2020, 2:46 PM

CBogen subscribed.

Gehel closed subtask T262610: Enable ICUTokNorm() for Glent M0 and M1 as Resolved.Sep 28 2020, 2:36 PM

Gehel closed subtask T238151: Tune Glent Method 1 algorithm as Resolved.Oct 26 2020, 1:37 PM

Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).Nov 4 2021, 2:51 PM

Gehel raised the priority of this task from Medium to High.Mar 17 2022, 1:20 PM