Page MenuHomePhabricator

TTMServer should support multi-dc configuration
Closed, ResolvedPublic

Description

Translate is not robust in face of datacenter switch. At the moment, all writes are done only to eqiad, so the ttmserver index is missing from codfw.

Event Timeline

Nikerabbit triaged this task as Unbreak Now! priority.

I am marking this UBN now until Translate at least fails gracefully. I do not want to block the datacenter switch. Expects updates early next week. I will file blockers for individual action items and then adjust priorities accordingly.

One possible workaround would be to create empty TTMServer index in codfw and then re-fill the existing index after switchover, as there is no manual incremental update. I am however trying to make it so that empty index is not necessary, but not sure if I can fix re-fill in time.

Looking at https://gerrit.wikimedia.org/r/#/c/282153/1/wmf-config/CommonSettings.php, it seems that even if we switch mediawiki to codfw, Translate extension should still talk to eqiad. Of course, impossible to actually test before the switch, but code looks good (code never lies, it is just misleading).

NOTE: Language team is in India for offsite during the switch

The migration seems to happen late afternoon in IST, so in case of issues we might not be available to assist immediately.

Nikerabbit lowered the priority of this task from Unbreak Now! to High.Apr 15 2016, 7:36 AM

Okay let's hope the config will work. I did not get the patch reviewed yet. It is in gerrit in case it is needed, alternative would be to create an empty index in codfw or disable TTMServer entirely in the config.

I am not currently working on this. Comments on the blocking tasks would be useful.

We are currently serving traffic from codfw. I have not heard of any issue about Translate (not really sure what to check) so I assume the configuration to have it keep talking to eqiad is working jsut fine (we do see traffic in eqiad related to TTMServer).

There is still work to do to make Translate fully datacenter aware. At the moment, if eqiad is actually down, we loose Translate. This work should happened on the sub tasks on this one. I'm closing this for now as the switch itself is going well. Feel free to re-open if you know otherwise.

It does work indeed.

I was planning to keep this open as a tracker. Feel free to remove some project tags you think are no longer relevant.

Translation memory and translation search are not essential functionality, it can have some downtime. Having said that, that is not a reason not to implement the fixes necessary to support multi-dc. It is just question of priority, for example T124423: Rewrite Fuzzy Like query for Translate to use with ES > 2 might come first if ES2 upgrade is in the plans, as that would break the whole feature.

Nikerabbit renamed this task from Make Translate extension ready for the switch to codfw to TTMServer should support multi-dc configuration.Apr 20 2016, 1:52 PM
Nikerabbit updated the task description. (Show Details)

I have no issue in keeping this open! And yes, I agree that the upgrade to ES > 2 is probably of higher priority. Let me know if I can help in any way (given my limited understanding of all that, probably not).

For now there is some TTMServer work planned this quarter, and given we almost finished other Translate work already, I think we can do some steps towards this or the upgrade. Any insight (now or later) when ES >2 upgrade might happen will help us to plan ahead. But the upgrade however basically necessitates a rewrite of half of the TTMServer code, so I don't think we have enough time finish that. I would like to combine that rewrite with improving the algorithm to fix the known shortcomings in performance (T101236). Code review and feedback on the rewrite strategy (when it is planned) will be crucial.

Upgrade to ES >2 should be done this quarter (part of our goals), but the exact strategy is still a bit unclear. In particular the deployment path is complex, we need to find a way to write to both ES 1.7 and 2.x ... We should include you in the discussion as you will be impacted.

@Gehel any updates on this? I guess it's going to impact our switchover this time as well?

This is not strictly a hard blocker for the Elasticsearch 5 upgrade, but @dcausse says it makes sense to try to take care of it at the same time. I've added it as a subtask for T154501: [Epic, Q3 Goal] Upgrade search systems to Elasticsearch 5 for now.

Change 335783 had a related patch set uploaded (by DCausse):
Add support for Multi-DC for TTMServices

https://gerrit.wikimedia.org/r/335783

Change 335824 had a related patch set uploaded (by DCausse):
Enable Translation memories multi-DC support

https://gerrit.wikimedia.org/r/335824

Code is merged and will roll out to beta cluster soon. We don't have a multi-cluster setup available in beta cluster so wont be directly testable beyond ensuring nothing broke. Once the train rolls forward with this code we will need to build the codfw index and add it as a mirror (using dcausse patch above, "Enable Translation memories multi-DC support")

Change 335783 merged by jenkins-bot:
Add support for Multi-DC for TTMServices

https://gerrit.wikimedia.org/r/335783

ensuring nothing broke

A simple test such as adding/changing some translation at https://meta.wikimedia.org/wiki/Special:Translate and seeing whether it can be found at https://meta.wikimedia.org/wiki/Special:SearchTranslations might be in order too.

Change 335824 merged by jenkins-bot:
[operations/mediawiki-config] Enable Translation memories multi-DC support

https://gerrit.wikimedia.org/r/335824

Mentioned in SAL (#wikimedia-operations) [2017-03-06T14:13:05Z] <addshore@tin> Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:335824|Enable Translation memories multi-DC support]] T132076 1/2 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2017-03-06T14:14:59Z] <addshore@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:335824|Enable Translation memories multi-DC support]] T132076 2/2 (NOOP) (duration: 00m 42s)

The code is now live and activated in production, I'm running a refresh on codfw to catch up updates since the last copy.