Page MenuHomePhabricator

TTMServer performance and coverage issues
Open, NormalPublic1 Story Points

Description

The latest fixes to TTMServer done some months ago are not enough. During translation rally at translatewiki.net, translation memory was using too much cpu time. In addition there have been reports and observations that suggestions are not found, for example when translating the tech news with many repeating parts.

During the Lyon hackathon I spoke with David Chan who suggested to replace the current FuzzyLikeThis query with checking some ngrams from beginning and end of the strings. Those need to be stored separately at indexing time unless there is a way to instruct ES to do it for us. In any case short one to three word strings need special attention.

It seems that current performance bottleneck is fetching too many string contents for comparison and scoring, not the scoring itself.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Nikerabbit raised the priority of this task from to High.Jun 3 2015, 11:25 AM
Nikerabbit updated the task description. (Show Details)
Nikerabbit added a subscriber: Nikerabbit.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2015, 11:25 AM
Nemo_bis added a subscriber: Nemo_bis.
Arrbee set Security to None.
Stryn added a subscriber: Stryn.Jun 3 2015, 7:01 PM
santhosh edited a custom field.Jun 10 2015, 6:28 AM
Arrbee assigned this task to Nikerabbit.Jun 10 2015, 8:21 AM
Arrbee added a subscriber: Arrbee.
Restricted Application added a project: Discovery. · View Herald TranscriptJun 17 2015, 5:48 AM

Change 219388 had a related patch set uploaded (by Nikerabbit):
Use Filtered query instead of post_filter for TTMServer suggestion.

https://gerrit.wikimedia.org/r/219388

Change 219388 merged by jenkins-bot:
Use Filtered query instead of post_filter for TTMServer suggestion.

https://gerrit.wikimedia.org/r/219388

Nemo_bis added a subscriber: Phoenix303.

We should probably let users know that starting next week, thanks to @Phoenix303, they should get faster translation suggestions and that they should report any weirdness.

Arrbee removed Nikerabbit as the assignee of this task.Jul 21 2015, 9:27 PM
Arrbee removed a project: LE-Sprint-88.

I do not experience a notable speedup with TM suggestions.

At least when translating the weekly tech newsletter in MW: invariant or next-to-invariant strings of more than, say, 5 characters length are never found in TM. Maybe, this is another issue that has to be investigated separately.

Nemo_bis added a subscriber: dcausse.

I think one of the problem with this function is that it uses very slow elasticsearch functionalities:

  • fuzzy like this: deprecated and will be removed in elasticsearch 2.0 due to perf issues
  • function score on all docs: the levenshtein distance will be applied on all docs returned by the fuzzy like this, we could optimize this part by running this function score inside the rescore phase which would allow us to compute the distance on a limited number of docs thanks to the rescore window.

Concerning the fuzzy like this I would suggest investigating into another approach based on char n-gram.

Elitre added a subscriber: Elitre.Oct 26 2015, 12:20 PM
Elitre added a comment.EditedDec 21 2015, 1:29 PM

I'm having problems with the December issue of the VE newsletter. I was able to get the translation memory just once, but then fixing something on the source code and then going back to translate isn't bringing back the TM for me :/ Anything I could do?

I'm having problems with the December issue of the VE newsletter. I was able to get the translation memory just once, but then fixing something on the source code and then going back to translate isn't bringing back the TM for me :/ Anything I could do?

On what kind of translation units were you looking for suggestions? I only see very long units which change every time, of very short recurring items (mostly the headers), for which TM currently works for me.

Even the headers were failing for me.

Translations:VisualEditor/Newsletter/2016/February/38/it AFAICT is the same than in the previous newsletter, and it isn't suggested by the system ATM.

Yes, it's identical: https://meta.wikimedia.org/?diff=15373013&oldid=15166609
The unit is rather long and the "Loading..." for translation memory suggestions systematically times out after 10 seconds for me.

Amire80 lowered the priority of this task from High to Normal.Mar 30 2016, 9:09 AM
Amire80 moved this task from Backlog to Translate on the Language-Engineering April-June 2016 board.

For clarification, the normal priority is relative to other tasks in Language-Engineering April-June 2016 and does not mean this task would suddenly not be important.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 13 2016, 4:25 PM
Deskana added a subscriber: Deskana.

This doesn't contain actionable work for Discovery-Search so I have removed that tag; let us know if you need input on this task, and we will happily provide it.

Johan added a subscriber: Johan.Feb 17 2017, 8:32 PM

I think this is a bigger issue for e.g. Tech News than one would first assume.

We have a couple of items that are specifically designed to be exactly the same every week, to make it easier for the translators – they are more complicated the first time you translate them, because you might need to figure out how to best represent dates in your language, but it will save time and effort in the long run. Or so goes the theory. But if they don't, then you have complicated items where you either have to go back to another issue and copy and paste, remember what to do with the code or be familiar enough with what it's doing that you realize you can simply remove it and exchange it for dates in normal text.

Examples:

<translate><!--T:20-->
You can join the next meeting with the VisualEditor team. During the meeting, you can tell developers which bugs you think are the most important. The meeting will be on [<tvar|time>http://www.timeanddate.com/worldclock/fixedtime.html?hour=20&min=00&sec=0&day=14&month=02&year=2017</> {{#time:<tvar|defaultformat>j xg</>|<tvar|date4>2017-02-14</>|<tvar|format_language_code>{{CURRENTCONTENTLANGUAGE}}</>}} at 20:00 (UTC)]. See [[<tvar|link>mw:VisualEditor/Weekly triage meetings</>|how to join]].</translate>
<translate><!--T:41-->
The [[<tvar|version>mw:MediaWiki 1.29/wmf.12</>|new version]] of MediaWiki will be on test wikis and MediaWiki.org from {{#time:<tvar|defaultformat>j xg</>|<tvar|date1>2017-02-14</>|<tvar|format_language_code>{{CURRENTCONTENTLANGUAGE}}</>}}. It will be on non-Wikipedia wikis and some Wikipedias from {{#time:<tvar|defaultformat>j xg</>|<tvar|date2>2017-02-15</>|<tvar|format_language_code>{{CURRENTCONTENTLANGUAGE}}</>}}. It will be on all wikis from {{#time:<tvar|defaultformat>j xg</>|<tvar|date3>2017-02-16</>|<tvar|format_language_code>{{CURRENTCONTENTLANGUAGE}}</>}} ([[<tvar|calendar>mw:MediaWiki 1.29/Roadmap</>|calendar]]).</translate>
Stryn added a comment.Feb 17 2017, 9:09 PM

I'm translating almost every week the tech news, and it's quite frustrating when you can't get the correct suggestions from the translation memory. Instead you have to go to the previous tech news and copypaste the correct content from there. In my opinion this should get higher priority as it affects many users globally and makes translating time consuming.

There is a bunch of reports again that TTMServer doesn't work. From API response for a frequently appearing long paragraph I can see it is spending a lot of time in TTMServer (21.37 seconds) without returning any results. Assuming nothing changed in ElasticSearch cluster, it looks like we have a crossed some kind of threshold.

Johan added a comment.Jul 3 2018, 12:49 AM

@Nikerabbit What's the long-term implications of this, if we have indeed crossed said treshold? What can we expect?

Johan added a comment.Jul 3 2018, 12:49 AM

(If possible to say, I mean, I do understand that "some kind of treshold" isn't very exact.)

So, only recent change in Translate is c55a8ee – I don't see how it could cause this, but maybe @dcausse could know that, or whether there has been any changes in ElasticSearch cluster that could cause TTMServer to work poorly.

If this is not caused by any external changes, it means that the algorithm has stopped working (what I called the threshold) for some reason such as:

  1. The amount of data has increased to the extent that the search is now taking too long and timing out.
  2. The amount of data has increased to the extent that the algorithm, which only loads a subset of it, incorrectly guesses based on the first subset to not load a second, larger subset.
  3. The amount of data has increased to the extent that the algorithm loads more data (but still only a subset) fails to find matches because it loads a poorly selected and/or two small subset of the data.

I won't be able to debug this extensively until August. Once we understand the issue, we can attempt some small tweaks, if possible. If larger changes are required, I expect those could be started at FYQ2 at earliest by me/Language team. Maybe earlier if we get help and or TTMServer stops working completely and this task would be re-prioritized.

EBjune moved this task from needs triage to Up Next on the Discovery-Search board.Jul 5 2018, 5:18 PM
EBernhardson added a comment.EditedJul 5 2018, 10:30 PM

I pulled latency numbers from the apiaction logs. Overall it doesn't look like performance has changed in any noticable way in the last 90 days on the wmf prod cluster. The y axis here is milliseconds for all lines except n_requests where it is an absolute count per day:

Generated using the following HQL. This is the time for the all translationaids, but in a look it seems like time spent is almost entirely in ttmserver.

SELECT date(concat_ws('-', YEAR, MONTH, DAY)) AS date,
       count(1) AS n_requests,
       percentile_approx(timespentbackend, array(0.5, 0.75, 0.95, 0.99, 0.999)) AS percentiles
FROM wmf_raw.apiaction
WHERE YEAR = '2018'
  AND params['action'] = 'translationaids'
  AND (params['prop'] IS NULL
       OR params['prop'] LIKE '%ttmserver%')
GROUP BY YEAR,
         MONTH,
         DAY

I'm noting for documentation purposes that multiple people are again complaining about translation memory not working when translating tech news.