Page MenuHomePhabricator

Analysis of Method 2 Suggestion results
Closed, ResolvedPublic

Description

Gather suggestion output from Elastic-based suggestions and Method 2 (CJK) suggestions for a collection of data, and analyze the results.

Analysis will include counting how often Elastic-based suggestions are made, how often Method 2 suggestions are made, how often both are made, and a manual review of a sample when both are made to see which does better—which is the same as what we did for M0.

I'll be getting help from speakers of Chinese, Japanese, and/or Korean to review the sample where both make suggestions, and any other review that seems to be necessary. (M‍1 had extra stuff that needed review, for example.)

Event Timeline

TJones triaged this task as Medium priority.Feb 10 2020, 10:43 PM
TJones created this task.

Data has been wrangled and prepped for review. I have a Japanese reviewer, a likely Korean reviewer, and I'm waiting to hear back on Chinese. Because of a technical glitch, I only have older Japanese data (from Feb), but it should be fine.

Completed analysis of Japanese and Korean suggestions, reviewed by speakers—thanks, Jerry & Lisa!

Korean and Japanese follow a similar pattern:

  • ~70–80% of queries are in the expected writing system(s).
  • ~10–20% of queries are in Latin (and the rest are a mixed bag).
  • ~8–12% of queries get suggestions from the current production DYM, they are generally mediocre (~⅓–½ are rated as good).
    • ~⅓–½ of suggestions made are for Latin queries, and they are generally poor (~¼–⅓ are rated as good).
    • Suggestions in the expected writing system(s) are generally mediocre (up to ½ are rated as good).
  • M‌2 provides a small impact (~¾–2½%), but with some non-trivial increase in coverage (~8¾%–22%).
  • However, M‌2 suggestions are generally poor (~30% are rated as good).

The results aren't great, but the new M‌2 suggestions are largely orthogonal to the existing prod/phrase suggester suggestions, and of roughly similar quality. We should run an A/B test and then decide whether the additional effort to implement M‌2 is worth whatever increase in clickthrough we see.

Full write up (for Korean and Japanese, so far) is on MediaWiki.

Chinese to come, once speaker review is done.

Under Korean Stats (and Japanese stats). Should identical be unique?

here are 312,698 queries in our test corpus. 276,745 (88.502%) of them are identical after (basic) normalization.

As to the report and it's recommendations, overall this seems reasonable. As stated the results aren't looking particularly promising, but the existing suggestions are also similarly bad. Since they seem to provide suggestions to disjoint sets deploying it could still be an overall improvement (although continuing the trend of slightly embarasing suggestions in some cases).

Under Korean Stats (and Japanese stats). Should identical be unique?

No, it's identical, as in the normalization had no effect. Not too surprising for CJK, since lowercasing often does nothing. Normalizing whitespace could have an effect, though. It's something I put in there in the early days to see how much normalization matters.

Under Korean Stats (and Japanese stats). Should identical be unique?

No, it's identical, as in the normalization had no effect. Not too surprising for CJK, since lowercasing often does nothing. Normalizing whitespace could have an effect, though. It's something I put in there in the early days to see how much normalization matters.

Ahh. I was reading this as 88.5% of queries are identical, as in post-normalization 88% of queries are for the same terms.