Apparently it'll need rewriting :(
|search/extra : master||Port FuzzyLikeThis to the extra plugin|
- Mentioned In
T133124: Devise a plan on how to upgrade to Elasticsearch 2.3 without turning user-facing search features off
T132076: TTMServer should support multi-dc configuration
- Mentioned Here
- T48484: Improve translation memory for larger chunks of text
T101236: TTMServer performance and coverage issues
It's a bit frustrating that we keep running into issues with the Translate extension whenever want to upgrade Elasticsearch. Is there something we can do to resolve these issues once and for all?
I see two main options:
- Stop using a product which changes APIs in incompatible ways too often.
- Dedicate some more resources so that TTMServer is not maintained by mostly by one person (me) with help of volunteers.
I am also a bit frustrated because this is the fourth big rewrite for TTMServer I will be doing to meet the demands of Wikimedia production infrastructure.
Also the fact that I am not an ElasticSearch expert might make the TTMServer code less future proof. I will need help for this current for sure, both because of the expertise and because there is limited time, and because I hope we can make the next rewrite so good that it will work for a long time.
I don't remember any major issues in previous updates. There was one case where the Elastica library changed APIs in very bad and not backwards compatible manner, but I don't remember it causing signficant problems. Could you elaborate how TTMServer is making your work harder, that might help me to advocate higher priority for TTMServer work.
I think we have two options here:
- Rewrite the feature and try to address performance issues
- Forward port fuzzy like to the extra plugin
I have some ideas for 1 but I don't know very well the data inside TTM and it'll require some testing to figure out if it works as expected.
Option 2 is maybe the easiest to do in time but will add a dependency on top of TTM as it'll require extra plugin to run.
Another problem with this upgrade is that it won't be possible to use TTM on elastic1.x. @Nikerabbit you told me once about translatewiki.net which is hosted on a separate elastic cluster. Is it OK to also migrate this elastic cluster to es2.x and add the extra plugin if needed?
I would be happy if we could do (1), as it is needs to be done anyway sometime soon. But I also understand that we might want to opt for option (2), or only do (1) insofar as it intersects with rewrite work (i.e. sentence segmentation could also help, but not necessarily intersect with rewriting the querying system) to not block the upgrade with too costly rewrite.
Updating translatewiki.net elastic cluster (actually it's just one instance, hardly a cluster) wont be a problem. Only extensions using it are Translate and CirrusSearch.
I am currently not aware of any third parties using TTMServer with ElasticSearch, though there might be some. For them it is probably the best to stay with older version of Translate until they can upgrade.
We had an IRC discussion with @dcausse yesterday. Here is my attempt at a summary.
dcausse will attempt to port deleteByQuery and fuzzy matching as a plugin for new ElasticSearch. This plugin is to be installed on production and translatewiki.net and minimal changes are expected to current Translate code. This will unblock the ElasticSearch upgrade from Translate side.
dcausse will check if we can try filtering by length in translation memory queries as possible quick win in performance. This will likely require a schema (mapping) change and reindexing of the data. Reindexing causes some downtime to translation search and translation memory, but I think it is okay as planned activity.
We discussed different ways to improve the performance issues documented in T101236: TTMServer performance and coverage issues
- Sentence level segmentation to shorten the units.
- This is language specific and non trivial. Discarded for now
- More aggressive pruning
- The length based filtering above is an example of this.
- Altering the schema so that translations of same text are stored in one document
- This allows filtering documents which do not even have translation in the target language in the translation memory query. There can be hundreds of fields per document, but dcausse thinks this is not an issue. Updates will be a bit heavier and more involved, but I think this is not an issue.
- Using different query strategy for segments of different length, i.e. some words, sentence and a paragraph
- This allows us to do things like edit distance over words for paragraph sized segments and something else for shorter segments.
So we are doing a bit of (2) now, and would like to try (3) and perhaps also (4) later, but not clear when exactly.
IT was initially, since elasticsearch 2.x does not include fuzzy like this which is the source of the performance problems. Because we took a different approach to fixing the problem (backporting (forwardporting? )fuzzy like this into elasticsearch 2.x) that's basically no longer a blocker, although it does still need to be resolved.
@Nikerabbit yes. I thought there was some other ticket for the performance issues but didn't find anything, so this ticket will do. I've taken off the blocker since this no longer blocks the upgrade.
We should certainly still find a way to remove the usage of fuzzy like this though, it will get to be more and more trouble to maintain this as we continue upgrading elasticsearch versions (The next elasticsearch major version upgrade will happen in the Jan-Mar timeframe)