Page MenuHomePhabricator

Make ElasticSearchTTMServer results consistent enough
Closed, ResolvedPublic

Description

Context

When translating at meta I noticed that sometimes I do not get the translation memory does not always show the suggestions I expect. Furthermore, I sometimes get different set of suggestions for same string.

Reason is that TTMServer translation memory query consists of two parts: First find all matching source texts and order them by edit distance, then fetch the translations for the strings if present. To simplify a bit, lets assume the following:

the number of results for the first query is large, say N(all) > 1000
furthermore the number of perfect matches is also large, N(perfect)> 200
the number of results for the second query, assuming we will inspect all N, is small, M < 30
Let’s say each message can be identified with Nₙ and if it has corresponding translation in the target language I will call that as Mₙ. In the actual code we will fetch 50 results, but let’s make it simpler and do 5. Let’s search for string “user” and assume that N₁...N₂₀ are perfect matches. Since we order only by score, there is no guarantee that we will always get N₁...N₅ as a result. Let’s say we get [N₆, N₃, N₈, N₁, N₂]. Then, if only N₇ has a translation M₇, for this query we do not return any suggestions.

Simple solution is to increase the number of results we fetch for the first query. This will only move the problem further away. Better solution is to iteratively get more results until some condition. Most inefficient condition would be to fetch all results until the score goes under a given threshold. Little bit more intelligent solution would be to fetch at least all solutions having score larger or equal to lowest score in the first query. This ensures that if N stays unchanged, we will always get all suggestions we currently can get. To be able to do this iteratively, we need to ensure that results are sorted consistently, so we need to use secondary sort key besides score.

Narrative

As a translator, I can reliably use translation memory suggestions, so that I can translate faster and more consistently and I can be sure that lack of suggestions means there are no similar translations.

Acceptance Criteria

  1. All translation suggestions that could be shown currently, are shown
  2. Assuming N does not change, we will always show the same result

Deliverables

Enter link to actually done tangible deliverable(s)

Event Timeline

Arrbee assigned this task to Nikerabbit.
Arrbee raised the priority of this task from to Medium.
Arrbee updated the task description. (Show Details)
Arrbee changed Security from none to None.

Change 176971 had a related patch set uploaded (by Nikerabbit):
Make ElasticSearchTTMServer results consistent enough

https://gerrit.wikimedia.org/r/176971

Patch-For-Review

Nikerabbit moved this task from In Review to Done on the LE-Sprint-79 board.

Change 176971 merged by jenkins-bot:
Make ElasticSearchTTMServer results consistent enough

https://gerrit.wikimedia.org/r/176971