Translate is not robust in face of datacenter switch. At the moment, all writes are done only to eqiad, so the ttmserver index is missing from codfw.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | debt | T151324 [epic] System level upgrade for cirrus / elasticsearch | |||
Resolved | • Deskana | T154501 [Epic, Q3 Goal] Upgrade search systems to Elasticsearch 5 | |||
Resolved | Joe | T154658 Prepare and improve the datacenter switchover procedure | |||
Resolved | dcausse | T132076 TTMServer should support multi-dc configuration | |||
Resolved | Nikerabbit | T132254 Saving a translation fails if TTMServer index is missing | |||
Resolved | dcausse | T132315 Implement update freeze and/or delays for TTMServerMessageUpdateJob |
Event Timeline
I am marking this UBN now until Translate at least fails gracefully. I do not want to block the datacenter switch. Expects updates early next week. I will file blockers for individual action items and then adjust priorities accordingly.
One possible workaround would be to create empty TTMServer index in codfw and then re-fill the existing index after switchover, as there is no manual incremental update. I am however trying to make it so that empty index is not necessary, but not sure if I can fix re-fill in time.
Looking at https://gerrit.wikimedia.org/r/#/c/282153/1/wmf-config/CommonSettings.php, it seems that even if we switch mediawiki to codfw, Translate extension should still talk to eqiad. Of course, impossible to actually test before the switch, but code looks good (code never lies, it is just misleading).
The migration seems to happen late afternoon in IST, so in case of issues we might not be available to assist immediately.
Okay let's hope the config will work. I did not get the patch reviewed yet. It is in gerrit in case it is needed, alternative would be to create an empty index in codfw or disable TTMServer entirely in the config.
We are currently serving traffic from codfw. I have not heard of any issue about Translate (not really sure what to check) so I assume the configuration to have it keep talking to eqiad is working jsut fine (we do see traffic in eqiad related to TTMServer).
There is still work to do to make Translate fully datacenter aware. At the moment, if eqiad is actually down, we loose Translate. This work should happened on the sub tasks on this one. I'm closing this for now as the switch itself is going well. Feel free to re-open if you know otherwise.
It does work indeed.
I was planning to keep this open as a tracker. Feel free to remove some project tags you think are no longer relevant.
Translation memory and translation search are not essential functionality, it can have some downtime. Having said that, that is not a reason not to implement the fixes necessary to support multi-dc. It is just question of priority, for example T124423: Rewrite Fuzzy Like query for Translate to use with ES > 2 might come first if ES2 upgrade is in the plans, as that would break the whole feature.
I have no issue in keeping this open! And yes, I agree that the upgrade to ES > 2 is probably of higher priority. Let me know if I can help in any way (given my limited understanding of all that, probably not).
For now there is some TTMServer work planned this quarter, and given we almost finished other Translate work already, I think we can do some steps towards this or the upgrade. Any insight (now or later) when ES >2 upgrade might happen will help us to plan ahead. But the upgrade however basically necessitates a rewrite of half of the TTMServer code, so I don't think we have enough time finish that. I would like to combine that rewrite with improving the algorithm to fix the known shortcomings in performance (T101236). Code review and feedback on the rewrite strategy (when it is planned) will be crucial.
Upgrade to ES >2 should be done this quarter (part of our goals), but the exact strategy is still a bit unclear. In particular the deployment path is complex, we need to find a way to write to both ES 1.7 and 2.x ... We should include you in the discussion as you will be impacted.
@Gehel any updates on this? I guess it's going to impact our switchover this time as well?
This is not strictly a hard blocker for the Elasticsearch 5 upgrade, but @dcausse says it makes sense to try to take care of it at the same time. I've added it as a subtask for T154501: [Epic, Q3 Goal] Upgrade search systems to Elasticsearch 5 for now.
Change 335783 had a related patch set uploaded (by DCausse):
Add support for Multi-DC for TTMServices
Change 335824 had a related patch set uploaded (by DCausse):
Enable Translation memories multi-DC support
Code is merged and will roll out to beta cluster soon. We don't have a multi-cluster setup available in beta cluster so wont be directly testable beyond ensuring nothing broke. Once the train rolls forward with this code we will need to build the codfw index and add it as a mirror (using dcausse patch above, "Enable Translation memories multi-DC support")
ensuring nothing broke
A simple test such as adding/changing some translation at https://meta.wikimedia.org/wiki/Special:Translate and seeing whether it can be found at https://meta.wikimedia.org/wiki/Special:SearchTranslations might be in order too.
Change 335824 merged by jenkins-bot:
[operations/mediawiki-config] Enable Translation memories multi-DC support
Mentioned in SAL (#wikimedia-operations) [2017-03-06T14:13:05Z] <addshore@tin> Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:335824|Enable Translation memories multi-DC support]] T132076 1/2 (duration: 00m 50s)
Mentioned in SAL (#wikimedia-operations) [2017-03-06T14:14:59Z] <addshore@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:335824|Enable Translation memories multi-DC support]] T132076 2/2 (NOOP) (duration: 00m 42s)
The code is now live and activated in production, I'm running a refresh on codfw to catch up updates since the last copy.