Page MenuHomePhabricator

Implement a stronger synchronization in RepoNG and Translate
Open, MediumPublic

Description

Initial requirement

This is a follow-up to T48833 where I implemented repository-state synchronization.

For fully automated exports, even stronger synchronization is required: we should make sure that source changes are processed in translatewiki before we use a particular revision.

See https://translatewiki.net/wiki/Repository_management#Repository_state_synchronization for detailed description of the issues we have observed.

Current implementation plan

We're creating a group synchronization cache that tracks:

  • incoming changes to messages,
  • failed and passed message updates

Based on this we will display warnings to administrators who can then manually fix failed messages updates.

Eventually based on failure/failure resolution tracking we will stop import/exports of messages from Translatewiki.

Things to do

  • Track incoming changes to messages from various groups: addition, modification and deletion
  • Identify incoming message changes that failed to properly update on the wiki
  • Track failed/timed-out messages updates and display them to the administrator
  • Allow administrators to mark failures as "fixed".
  • Run MessageIndexRebuild job once there are no MessageUpdate Jobs in the synchronization cache.
  • Warn translation administrators when they try to export translations while there are errors in groups.
  • Halt imports for message groups that have unresolved failures.
  • Do not allow administrators to process changes from Special:ManageMessageGroups incase of unresolved failures.

Other minor to do

  • Timeout should be based on the number of messages to be processed for a group
  • Update translatewiki configuration to pass the new flags to check group sync cache for import and exports
  • Logging when a group / message is marked as resolved by the administrator

Pending decisions

  1. What should happen when a group export does not happen due to sync issues; administrator has to retry again after sometime? They will have to ensure that they check the export logs.
  2. What should happen if a group import does not happen due to sync issues; these are automatically run. Should we increase frequency of how often the import runs?
  3. Exports and imports should not run simultaneously. This will be outside the scope of the group synchronization cache but something may want, to achieve "strong synchronization".
  4. Should exports be stopped if messages are waiting to be processed on Special:ManageMessageGroups? I would say that it should. We will have to check specifically for MediaiWiki / non-MediaiWiki exports.

Update log

  1. 20-01-2021: Changes for this patch caused a production error: T272428: Error 1146: Table 'mediawikiwiki.translate_cache' doesn't exist
  2. 02-02-2021: https://phabricator.wikimedia.org/T182433#6797412
  3. ....

Patches

The list of Gerrit patches submitted for this task (including subtasks) can be found here: https://gerrit.wikimedia.org/r/q/topic:%22strong-sync%22+(status:open%20OR%20status:merged)

Details

ProjectBranchLines +/-Subject
mediawiki/extensions/Translatemaster+21 -0
mediawiki/extensions/Translatemaster+37 -3
mediawiki/extensions/Translatemaster+12 -0
translatewikimaster+2 -0
mediawiki/extensions/Translatemaster+6 -0
mediawiki/extensions/Translatemaster+48 -3
mediawiki/extensions/Translatewmf/1.36.0-wmf.27+48 -3
mediawiki/extensions/Translatemaster+112 -13
mediawiki/extensions/Translatemaster+2 -1
mediawiki/extensions/Translatemaster+3 -1
mediawiki/extensions/Translatemaster+9 -6
translatewikimaster+29 -0
mediawiki/extensions/Translatemaster+89 -0
mediawiki/extensions/Translatemaster+71 -0
mediawiki/extensions/Translatemaster+103 -0
mediawiki/extensions/Translatemaster+1 -3
mediawiki/extensions/Translatemaster+25 -2
mediawiki/extensions/Translatemaster+414 -418
mediawiki/extensions/Translatemaster+665 -0
mediawiki/extensions/Translatemaster+366 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I came up with an alternative that might be easier to implement: check if there are any unprocessed message changes and bail out if such are found. For this we need to know which files are processed and which are not [1]. This can be combined with a simple lock taken by all auto(import|export) scripts to allow only one of them run at a time.

[1] Currently processed files are renamed to have a timestamp to keep history. To my knowledge the old files have been looked maybe once or twice to debug issues. We could probably just delete processed files, or have them be moved elsewhere. Maybe the ones we currently have could provide test cases for T196601: Support renames in Special:ManageMessageGroups.

From this ticket, it seems that it is not easy to make this work perfect on the first try of each export. That is a long-term objective I guess.

@Nikerabbit Can we do something to prevent T253830 in the short-term, or at least to find out about such issues without relying on accidental manual discovery?

It's impossible to have it always succeed to push a commit. What I do want is that it will never lose other changes like in those bug reports.

When I updated the documentation a few days ago, I realized that git rebase is the reason it still fails in cases in that should not fail with weak sync. To clarify, here is what I think about weak vs. strong:

  • weak synchronization prevents losing changes that are out of our control, basically that upstream repository has moved forward after our last import
  • strong synchronization prevents losing changes that are caused by unfinished or failed import, basically human errors, broken jobqueue, failed renames type of things

I would greatly appreciate help if someone with deeper git understanding would know if it is possible to instruct git rebase to use safer merge algorhitm so that the last example in the documentation could be prevented, or refute my theory in case I got it wrong. Otherwise, I see no option but to remove the git rebase step. This would increase the number of export failures (as any non-i18n change in between would cause it to fail too) in favor of safety.

I understand that there has to be a span of time between "checking out" latest master (to do the import) and "making the commit (export from wiki, overwrite files on disk, sending a patch to self-merge).

I don't understand why the state of the Git repo on the server changes during this time. This kind of conflict is naturally taken care of by Git. If I checkout master today, wait a week, and then replace nl.json with something different, and send that commit. It will work fine unless that file was changed meanwhile because gerrit knows the parent commit I made it with.

If there were other commits meanwhile, but they don't conflict or are trivial to three-way merge for Git, then it will self-merge just fine. There is no need to rebase it ahead of time afaik.

We deal with git repos using Gerrit, Github, Gitlab and also plain git. I'm going to do a few tests to verify my understanding of what is going on.

So git fails without rebase, as I thought:

developer@dev:~/git-tests/b (master)$ git push origin master
To github.com:Nikerabbit/reimagined-guacamole.git
 ! [rejected]        master -> master (fetch first)
error: failed to push some refs to 'git@github.com:Nikerabbit/reimagined-guacamole.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

While trying to execute the commands manually, I was not able to create a commit using git-rebase that would have accidentally removed qqq updates, so it looks like my theory of git rebase being the blame seems wrong. So I looked at T253830: L10n-bot removed needed message documentation again, and it indeed was an that needs "strong synchronization".

Some implementation ideas:

Check that jobqueue is empty

  • Not fully reliable, jobs can fail
  • False positives

Associate state hash to commit id

For example, take hash of all the contents of the strings, compare before export

  • Slow to check
  • There are legitimate differences: e.g. trailing spaces are trimmed, translations can be updated in the wiki

Do atomic updates and log them in the database

When Special:ManageMessageGroups submits an update, it can include metadata state=commit id. The jobs would start a transaction to ensure either everything is updated, or nothing. Once done, the state would be stored somewhere for quick lookup by export scripts.

  • Creates big jobs, possibly runs into memory issues
  • Not sure if it is possible to have mediawiki to do edits inside one transaction
  • Could help to postpone some slow stuff that is now currently run on every job to once after all the imports

Track all update jobs separately

When Special:ManageMessageGroups submits an update, it will also write a list of jobs in some reliable storage. Only once all jobs have succeeded, would we write the state update somewhere.

  • Jobs are inherently unreliable. What if some of them fail? How do we know when the last job finishes?
  • Smaller jobs, can be run in parallel.

Every job should log that it is finished. There could either be a periodic maintenance script to check whether all jobs have completed, and update the state, or every job could check the same and if the last one sees it is done, then update the state.

No changes means we are up to date

When we check for message changes using the script, if there are no changes for a group, we can assume it is up to date.

  • Would need for the script run twice before state is synchronized.
  • Easy, the script is already run regularly
  • Running the script itself again while sync is in progress is unsafe, as it may see incomplete state. So in a sense the script itself need strong synchronization to prevent it running again for a group which is not yet synched. This could be solved by the two solutions above, if we also set status of a group to "syncing" until it is finished.

Change 628631 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add persistent translate cache

https://gerrit.wikimedia.org/r/628631

Change 629365 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Update GroupSynchronizationCache to use the PersistentCache

https://gerrit.wikimedia.org/r/629365

Change 631209 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add incoming messages to the group sync cache

https://gerrit.wikimedia.org/r/631209

Change 606424 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Use sync cache in Special:ManageMessageGroups and MessageUpdateJobs

https://gerrit.wikimedia.org/r/606424

Change 635280 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Remove running of MessageIndex rebuild once groups are synced

https://gerrit.wikimedia.org/r/635280

Change 638137 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add script to query the group synchronization cache

https://gerrit.wikimedia.org/r/638137

Change 643026 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add JsonCodec to help with serialization/deserialization

https://gerrit.wikimedia.org/r/643026

Change 646677 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Display groups in sync on ManageMessageGroups

https://gerrit.wikimedia.org/r/646677

Change 647007 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add script to clear the group synchronization cache

https://gerrit.wikimedia.org/r/647007

Change 647195 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[translatewiki@master] puppet: Add periodic run of completeExternalTranslation

https://gerrit.wikimedia.org/r/647195

Change 643026 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add JsonCodec to help with serialization/deserialization

https://gerrit.wikimedia.org/r/643026

Change 628631 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add persistent translate cache

https://gerrit.wikimedia.org/r/628631

Change 629365 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Update GroupSynchronizationCache to use the PersistentCache

https://gerrit.wikimedia.org/r/629365

Change 631209 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add incoming safe-imports to the group sync cache

https://gerrit.wikimedia.org/r/631209

Change 606424 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Use sync cache in Special:ManageMessageGroups and MessageUpdateJobs

https://gerrit.wikimedia.org/r/606424

Change 635280 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Remove running of MessageIndex rebuild once groups are synced

https://gerrit.wikimedia.org/r/635280

Change 638137 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add script to query the group synchronization cache

https://gerrit.wikimedia.org/r/638137

Change 646677 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Display groups in sync on ManageMessageGroups

https://gerrit.wikimedia.org/r/646677

Change 647007 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add script to clear the group synchronization cache

https://gerrit.wikimedia.org/r/647007

Change 647195 merged by jenkins-bot:
[translatewiki@master] puppet: Add periodic run of completeExternalTranslation

https://gerrit.wikimedia.org/r/647195

Change 656873 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Strong sync: Fix issue with new messages not being removed

https://gerrit.wikimedia.org/r/656873

Change 656878 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Strong sync: Add remaining message count in completeExternalTranslation

https://gerrit.wikimedia.org/r/656878

Change 656892 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Strong sync: Fix incorrect queueing of MessageIndexRebuildJob

https://gerrit.wikimedia.org/r/656892

Change 656873 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Strong sync: Fix issue with new messages not being removed

https://gerrit.wikimedia.org/r/656873

Change 656878 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Strong sync: Add remaining message count in completeExternalTranslation

https://gerrit.wikimedia.org/r/656878

Change 656892 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Strong sync: Fix incorrect queueing of MessageIndexRebuildJob

https://gerrit.wikimedia.org/r/656892

Change 657229 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add flag to toggle the usage of the group synchronization cache

https://gerrit.wikimedia.org/r/657229

Change 657229 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add flag to toggle the usage of the group synchronization cache

https://gerrit.wikimedia.org/r/657229

Change 657306 had a related patch set uploaded (by Nikerabbit; owner: Abijeet Patro):
[mediawiki/extensions/Translate@wmf/1.36.0-wmf.27] Add flag to toggle the usage of the group synchronization cache

https://gerrit.wikimedia.org/r/657306

Change 657290 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[translatewiki@master] Enable group synchronization flag

https://gerrit.wikimedia.org/r/657290

Change 657294 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] MessageUpdateJob: Check GroupSyncCache only if its FileBasedMessageGroup

https://gerrit.wikimedia.org/r/657294

Change 657306 merged by jenkins-bot:
[mediawiki/extensions/Translate@wmf/1.36.0-wmf.27] Add flag to toggle the usage of the group synchronization cache

https://gerrit.wikimedia.org/r/657306

Mentioned in SAL (#wikimedia-operations) [2021-01-20T13:20:45Z] <urbanecm@deploy1001> Synchronized php-1.36.0-wmf.27/extensions/Translate/: 20decbd5cc3de0af655b9419cf69fc442ab056a4: Add flag to toggle the usage of the group synchronization cache (T272428; T182433) (duration: 01m 10s)

Change 657294 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] MessageUpdateJob: Check GroupSyncCache only for FileBasedMessageGroup

https://gerrit.wikimedia.org/r/657294

Change 657290 merged by jenkins-bot:
[translatewiki@master] Enable group synchronization flag

https://gerrit.wikimedia.org/r/657290

Change 658791 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[mediawiki/extensions/Translate@master] Add messages to interim cache when running safe imports

https://gerrit.wikimedia.org/r/658791

Change 658791 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Add messages to interim cache when running safe imports

https://gerrit.wikimedia.org/r/658791

Here's an update on what's done as of now:

  • We've added a group synchronization cache that tracks message updates - addition, modifications, renames and deletions.
  • We've added scripts that allow administrators to see what messages and message groups are being processed currently.
  • On Special:ManageMessageGroups we are displaying the message groups that are currently in processing.
  • A script has been put in place that runs periodically and identifies MessageUpdate job that are stuck or have timed out.

No decisions (such as blocking exports or imports) are made based on this tracking.

This has been deployed on Translatewiki for about 2 weeks. During this time we've identified issues and deployed fixes for them. The system appears to be reliable now, and we're ready to implement the next set of steps.

Change 676968 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] ProcessMessageChanges: Add flag to skip import on group sync error

https://gerrit.wikimedia.org/r/676968

Change 677274 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] ExportTranslationsMaintenanceScript: Add flag to skip group export

https://gerrit.wikimedia.org/r/677274