Page MenuHomePhabricator

Rewrite Fuzzy Like query for Translate to use with ES > 2
Closed, ResolvedPublic


Apparently it'll need rewriting :(

Event Timeline

Reedy raised the priority of this task from to Needs Triage.
Reedy updated the task description. (Show Details)
Reedy added subscribers: dcausse, Florian, Reedy, StudiesWorld.

It's a bit frustrating that we keep running into issues with the Translate extension whenever want to upgrade Elasticsearch. Is there something we can do to resolve these issues once and for all?

It's a bit frustrating that we keep running into issues with the Translate extension whenever want to upgrade Elasticsearch. Is there something we can do to resolve these issues once and for all?

I see two main options:

  1. Stop using a product which changes APIs in incompatible ways too often.
  2. Dedicate some more resources so that TTMServer is not maintained by mostly by one person (me) with help of volunteers.

I am also a bit frustrated because this is the fourth big rewrite for TTMServer I will be doing to meet the demands of Wikimedia production infrastructure.

Also the fact that I am not an ElasticSearch expert might make the TTMServer code less future proof. I will need help for this current for sure, both because of the expertise and because there is limited time, and because I hope we can make the next rewrite so good that it will work for a long time.

I don't remember any major issues in previous updates. There was one case where the Elastica library changed APIs in very bad and not backwards compatible manner, but I don't remember it causing signficant problems. Could you elaborate how TTMServer is making your work harder, that might help me to advocate higher priority for TTMServer work.

Putting this in "Up Next" for the Search Team as it blocks our upgrade. @dcausse indicated that he would work on this.

Actually, we can just put this in the sprint's backlog.

@Deskana I was wondering do you have more info on the exact timing of this work and how much time I should plan to use to collaborate on this? Next Language team sprint is starting next Wednesday.

@Nikerabbit ideally we'd like to deploy elastic 2.x at the end of May.

I think we have two options here:

  1. Rewrite the feature and try to address performance issues
  2. Forward port fuzzy like to the extra plugin

I have some ideas for 1 but I don't know very well the data inside TTM and it'll require some testing to figure out if it works as expected.
Option 2 is maybe the easiest to do in time but will add a dependency on top of TTM as it'll require extra plugin to run.

Another problem with this upgrade is that it won't be possible to use TTM on elastic1.x. @Nikerabbit you told me once about which is hosted on a separate elastic cluster. Is it OK to also migrate this elastic cluster to es2.x and add the extra plugin if needed?

I would be happy if we could do (1), as it is needs to be done anyway sometime soon. But I also understand that we might want to opt for option (2), or only do (1) insofar as it intersects with rewrite work (i.e. sentence segmentation could also help, but not necessarily intersect with rewriting the querying system) to not block the upgrade with too costly rewrite.

Updating elastic cluster (actually it's just one instance, hardly a cluster) wont be a problem. Only extensions using it are Translate and CirrusSearch.

I am currently not aware of any third parties using TTMServer with ElasticSearch, though there might be some. For them it is probably the best to stay with older version of Translate until they can upgrade.

We had an IRC discussion with @dcausse yesterday. Here is my attempt at a summary.

Short term

dcausse will attempt to port deleteByQuery and fuzzy matching as a plugin for new ElasticSearch. This plugin is to be installed on production and and minimal changes are expected to current Translate code. This will unblock the ElasticSearch upgrade from Translate side.

dcausse will check if we can try filtering by length in translation memory queries as possible quick win in performance. This will likely require a schema (mapping) change and reindexing of the data. Reindexing causes some downtime to translation search and translation memory, but I think it is okay as planned activity.

Other plans

We discussed different ways to improve the performance issues documented in T101236: TTMServer performance and coverage issues

  1. Sentence level segmentation to shorten the units.
    • This is language specific and non trivial. Discarded for now
  2. More aggressive pruning
    • The length based filtering above is an example of this.
  3. Altering the schema so that translations of same text are stored in one document
    • This allows filtering documents which do not even have translation in the target language in the translation memory query. There can be hundreds of fields per document, but dcausse thinks this is not an issue. Updates will be a bit heavier and more involved, but I think this is not an issue.
  4. Using different query strategy for segments of different length, i.e. some words, sentence and a paragraph
    • This allows us to do things like edit distance over words for paragraph sized segments and something else for shorter segments.

So we are doing a bit of (2) now, and would like to try (3) and perhaps also (4) later, but not clear when exactly.

Change 288378 had a related patch set uploaded (by DCausse):
Port FuzzyLikeThis to the extra plugin

Change 288378 merged by jenkins-bot:
Port FuzzyLikeThis to the extra plugin

IT was initially, since elasticsearch 2.x does not include fuzzy like this which is the source of the performance problems. Because we took a different approach to fixing the problem (backporting (forwardporting? )fuzzy like this into elasticsearch 2.x) that's basically no longer a blocker, although it does still need to be resolved.

@EBernhardson So if I understood correctly, you want to keep this task open to replace a temporary solution with a permanent solution?

@Nikerabbit yes. I thought there was some other ticket for the performance issues but didn't find anything, so this ticket will do. I've taken off the blocker since this no longer blocks the upgrade.

We should certainly still find a way to remove the usage of fuzzy like this though, it will get to be more and more trouble to maintain this as we continue upgrading elasticsearch versions (The next elasticsearch major version upgrade will happen in the Jan-Mar timeframe)

ahh, yes probably T101236. Lets close this and use that one moving forward then.