It's a bit frustrating that we keep running into issues with the Translate extension whenever want to upgrade Elasticsearch. Is there something we can do to resolve these issues once and for all?

In T124423#2229347, @Deskana wrote:

It's a bit frustrating that we keep running into issues with the Translate extension whenever want to upgrade Elasticsearch. Is there something we can do to resolve these issues once and for all?

I see two main options:

Stop using a product which changes APIs in incompatible ways too often.
Dedicate some more resources so that TTMServer is not maintained by mostly by one person (me) with help of volunteers.

I am also a bit frustrated because this is the fourth big rewrite for TTMServer I will be doing to meet the demands of Wikimedia production infrastructure.

Also the fact that I am not an ElasticSearch expert might make the TTMServer code less future proof. I will need help for this current for sure, both because of the expertise and because there is limited time, and because I hope we can make the next rewrite so good that it will work for a long time.

I don't remember any major issues in previous updates. There was one case where the Elastica library changed APIs in very bad and not backwards compatible manner, but I don't remember it causing signficant problems. Could you elaborate how TTMServer is making your work harder, that might help me to advocate higher priority for TTMServer work.

Nikerabbit removed a parent task: T122698: Upgrade ElasticSearch to version >=2.2.Apr 23 2016, 9:59 AM

Nikerabbit added a parent task: T133120: EPIC: Upgrade Wikimedia search cluster to use Elasticsearch 2.3.

Putting this in "Up Next" for the Search Team as it blocks our upgrade. @dcausse indicated that he would work on this.

Actually, we can just put this in the sprint's backlog.

• Deskana edited projects, added Discovery-Search (Current work); removed Discovery-Search.May 10 2016, 10:09 PM

• Deskana set Security to None.

Arrbee added a project: Language-Engineering April-June 2016.May 11 2016, 7:26 AM

@Deskana I was wondering do you have more info on the exact timing of this work and how much time I should plan to use to collaborate on this? Next Language team sprint is starting next Wednesday.

@Nikerabbit ideally we'd like to deploy elastic 2.x at the end of May.

I think we have two options here:

Rewrite the feature and try to address performance issues
Forward port fuzzy like to the extra plugin

I have some ideas for 1 but I don't know very well the data inside TTM and it'll require some testing to figure out if it works as expected.
Option 2 is maybe the easiest to do in time but will add a dependency on top of TTM as it'll require extra plugin to run.

Another problem with this upgrade is that it won't be possible to use TTM on elastic1.x. @Nikerabbit you told me once about translatewiki.net which is hosted on a separate elastic cluster. Is it OK to also migrate this elastic cluster to es2.x and add the extra plugin if needed?

Glaisher subscribed.May 11 2016, 9:54 AM

I would be happy if we could do (1), as it is needs to be done anyway sometime soon. But I also understand that we might want to opt for option (2), or only do (1) insofar as it intersects with rewrite work (i.e. sentence segmentation could also help, but not necessarily intersect with rewriting the querying system) to not block the upgrade with too costly rewrite.

Updating translatewiki.net elastic cluster (actually it's just one instance, hardly a cluster) wont be a problem. Only extensions using it are Translate and CirrusSearch.

I am currently not aware of any third parties using TTMServer with ElasticSearch, though there might be some. For them it is probably the best to stay with older version of Translate until they can upgrade.

dcausse claimed this task.May 12 2016, 8:19 AM

dcausse moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

We had an IRC discussion with @dcausse yesterday. Here is my attempt at a summary.

Short term

dcausse will attempt to port deleteByQuery and fuzzy matching as a plugin for new ElasticSearch. This plugin is to be installed on production and translatewiki.net and minimal changes are expected to current Translate code. This will unblock the ElasticSearch upgrade from Translate side.

dcausse will check if we can try filtering by length in translation memory queries as possible quick win in performance. This will likely require a schema (mapping) change and reindexing of the data. Reindexing causes some downtime to translation search and translation memory, but I think it is okay as planned activity.

Other plans

We discussed different ways to improve the performance issues documented in T101236: TTMServer performance and coverage issues

Sentence level segmentation to shorten the units.
- This is language specific and non trivial. Discarded for now
More aggressive pruning
- The length based filtering above is an example of this.
Altering the schema so that translations of same text are stored in one document
- This allows filtering documents which do not even have translation in the target language in the translation memory query. There can be hundreds of fields per document, but dcausse thinks this is not an issue. Updates will be a bit heavier and more involved, but I think this is not an issue.
Using different query strategy for segments of different length, i.e. some words, sentence and a paragraph
- This allows us to do things like edit distance over words for paragraph sized segments and something else for shorter segments.

So we are doing a bit of (2) now, and would like to try (3) and perhaps also (4) later, but not clear when exactly.

Change 288378 had a related patch set uploaded (by DCausse):
Port FuzzyLikeThis to the extra plugin

https://gerrit.wikimedia.org/r/288378

gerritbot added a project: Patch-For-Review.May 12 2016, 12:51 PM

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.May 12 2016, 12:55 PM

Change 288378 merged by jenkins-bot:
Port FuzzyLikeThis to the extra plugin

https://gerrit.wikimedia.org/r/288378

Anomie mentioned this in T133124: Devise a plan on how to upgrade to Elasticsearch 2.3 without turning user-facing search features off.May 17 2016, 2:18 PM

dcausse moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.May 25 2016, 7:35 AM

Arrbee moved this task from Backlog to Translate on the Language-Engineering April-June 2016 board.May 25 2016, 8:06 AM

Arrbee added a project: Language-Q4-2016-Sprint 3.

In what sense is this blocked by T124423?

IT was initially, since elasticsearch 2.x does not include fuzzy like this which is the source of the performance problems. Because we took a different approach to fixing the problem (backporting (forwardporting? )fuzzy like this into elasticsearch 2.x) that's basically no longer a blocker, although it does still need to be resolved.

EBernhardson removed a subtask: T101236: TTMServer performance and coverage issues.May 25 2016, 5:32 PM

Arrbee edited projects, added Language-Q4-2016-Sprint 4; removed Language-Q4-2016-Sprint 3.May 30 2016, 7:15 AM

Nikerabbit moved this task from Backlog to Done on the Language-Q4-2016-Sprint 4 board.Jun 13 2016, 6:22 AM

@EBernhardson So if I understood correctly, you want to keep this task open to replace a temporary solution with a permanent solution?

@Nikerabbit yes. I thought there was some other ticket for the performance issues but didn't find anything, so this ticket will do. I've taken off the blocker since this no longer blocks the upgrade.

We should certainly still find a way to remove the usage of fuzzy like this though, it will get to be more and more trouble to maintain this as we continue upgrading elasticsearch versions (The next elasticsearch major version upgrade will happen in the Jan-Mar timeframe)

T101236: TTMServer performance and coverage issues or even T48484: Improve translation memory for larger chunks of text perhaps?

ahh, yes probably T101236. Lets close this and use that one moving forward then.

Arrbee mentioned this in Language-Team.Sep 26 2016, 9:30 AM

Rewrite Fuzzy Like query for Translate to use with ES > 2Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Short term

Other plans

Rewrite Fuzzy Like query for Translate to use with ES > 2
Closed, ResolvedPublic
Actions