Page MenuHomePhabricator

Determine desired architecture to update localization strings for Wikimedia
Open, NormalPublic

Description

During the investigation of the instigating security incident we identified architectural issues with the systems and processes that make up the nightly localization updates. This update process is currently disabled and will remain disabled until there is a clear plan and resourcing for improving the architecture. That is the purpose of this task.

A couple of the issues brought up during our discussions:

  • the nightly job that sends updated translations to gerrit is not automated due to a few reasons
    • a potential partial solution to this is adding Keyholder support to the set of scripts that manage it
  • the wheel-waring that happens between scap and l10nupdate throughout the week.
    • do we backport updated translations to the current wmf.XX branch? Do we do something different?

There's also the meta question of: Is the current model (push from twn to gerrit, scap/l10nupdate update production) the right one or should it be different (eg: have a pull from twn for Wikimedia extensions)?

Event Timeline

greg triaged this task as Normal priority.Oct 10 2018, 7:03 PM
greg created this task.

If scap now has problems with the l10n updates, it should be reverted to the version which did not have such problems, so that l10n updates can run while the rest is fixed.

siebrand updated the task description. (Show Details)Oct 10 2018, 7:22 PM
greg added a comment.Oct 10 2018, 7:23 PM

If scap now has problems with the l10n updates, it should be reverted to the version which did not have such problems, so that l10n updates can run while the rest is fixed.

I presume you are referring to the wheelwaring issue? There is not a change in either software that 'caused' this, it is a fallout from the design.

Kghbln removed a subscriber: Kghbln.Oct 10 2018, 7:25 PM

It's unclear to me why these problems have become so pressing that they merit shutting off l10nupdate indefinitely. We've known about them for years, but the value provided by l10nupdate has always outweighed the minor inconveniences (e.g. scap undoing l10nupdate).

No doubt we need to redo the architecture of l10n cache more broadly and Core Platform Team has/will resource (AIUI from our discussion a few weeks ago) design work on what it should look like, but I don't think we should let the perfect be the enemy of the good and stop l10nupdate until the whole system is fixed.

It's unclear to me why these problems have become so pressing that they merit shutting off l10nupdate indefinitely. We've known about them for years, but the value provided by l10nupdate has always outweighed the minor inconveniences (e.g. scap undoing l10nupdate).

I'm similarly confused. Some of the language in this task's description is kind of stilted and difficult to follow, but it feels like localization updates are now being held hostage.

It's standard practice to disable a service if it has serious security or performance issues. For better or worse, we do not typically disable a service due to poor architecture or neglect. If we did, we wouldn't have much running code left!

There is a mention in the description of an "instigating security incident", which is probably the cause for this halt. @greg, is any data about it public yet? Could you perhaps add a link to whatever you can share in the description?

This update process is currently disabled and will remain disabled until there is a clear plan and resourcing for improving the architecture.

My understanding was that we will take a week, or at most two, two come up with a maintenance and development plan for l10nupdate, and then make a new call. I am all for making l10nupdate play nice so we can stop having these discussions in the future, but keeping it disabled for extended periods of time do not make sense.

cscott added a subscriber: cscott.Oct 11 2018, 2:53 PM
greg added a comment.Oct 11 2018, 3:10 PM

This update process is currently disabled and will remain disabled until there is a clear plan and resourcing for improving the architecture.

My understanding was that we will take a week, or at most two, two come up with a maintenance and development plan for l10nupdate, and then make a new call.

The timeline is entirely dependent upon how long the process takes to come to a "maintenance and development plan" (aka "a clear plan and resourcing"). The faster we can turn this task into a productive conversation on the desired architecture of this collection of tooling the faster we can move forward.

I am planning to write down my thoughts and proposals and post them early next week. I don't know the current WMF setup very well, so someone needs to fill in there. I am not at the technical conference, but it's possible that this topic will be discussed there the week after next week.

greg added a comment.Oct 11 2018, 3:28 PM

@greg, is any data about it public yet? Could you perhaps add a link to whatever you can share in the description?

Sure thing, I will when it's available.

greg added a comment.Oct 11 2018, 6:19 PM

It's unclear to me why these problems have become so pressing that they merit shutting off l10nupdate indefinitely. We've known about them for years, but the value provided by l10nupdate has always outweighed the minor inconveniences (e.g. scap undoing l10nupdate).

They may have been informally known, but there haven't been any clear bug reports/plans to address the technical debt which in turn makes it invisible tech debt. I hope to identify the most pressing technical debt issues (I *think* it's at least those two I mentioned in the description) so that we can formulate a plan for addressing them. Hence this task :)

No doubt we need to redo the architecture of l10n cache more broadly and Core Platform Team has/will resource (AIUI from our discussion a few weeks ago) design work on what it should look like

I wasn't clear/sure on that given the Language Team is the de facto stewards/maintainers of the relevant code bases. Is there something more official/written down we can reference on that topic?

It's unclear to me why these problems have become so pressing that they merit shutting off l10nupdate indefinitely. We've known about them for years, but the value provided by l10nupdate has always outweighed the minor inconveniences (e.g. scap undoing l10nupdate).

They may have been informally known, but there haven't been any clear bug reports/plans to address the technical debt which in turn makes it invisible tech debt.

There are definitely bug reports: T158360#3076449. The people who should be triaging those tickets are on the Release Engineering Team I think :-)

I hope to identify the most pressing technical debt issues (I *think* it's at least those two I mentioned in the description) so that we can formulate a plan for addressing them. Hence this task :)

So is this a code stewardship review then? In any case, we usually don't turn things off when they're getting reviewed.

No doubt we need to redo the architecture of l10n cache more broadly and Core Platform Team has/will resource (AIUI from our discussion a few weeks ago) design work on what it should look like

I wasn't clear/sure on that given the Language Team is the de facto stewards/maintainers of the relevant code bases. Is there something more official/written down we can reference on that topic?

l10nupdate is part of scap. When I last reviewed the open tickets as part of the RfC last year most of the problematic stuff was more on the Deployments side, IIRC.

There is a larger issue that the MediaWiki localisation cache system doesn't meet our current deployment needs in terms of speed, but I think that's tangential to the l10nupdate issues. I think if it took seconds to deploy new i18n messages, then we'd feel more comfortable running it every day (T203737) since reverts would be easier.

I think the best problem statement so far is https://www.mediawiki.org/w/index.php?title=Wikimedia_Platform_Engineering/MediaWiki_Core_Team/Backlog&oldid=1267452#Localisation_cache_do-over - I'm not sure if there's anything formal written down about Core Platform Team doing design work besides T158360#4638816.

greg added a comment.Oct 11 2018, 9:04 PM

It's unclear to me why these problems have become so pressing that they merit shutting off l10nupdate indefinitely. We've known about them for years, but the value provided by l10nupdate has always outweighed the minor inconveniences (e.g. scap undoing l10nupdate).

They may have been informally known, but there haven't been any clear bug reports/plans to address the technical debt which in turn makes it invisible tech debt.

There are definitely bug reports: T158360#3076449. The people who should be triaging those tickets are on the Release Engineering Team I think :-)

Great, thanks for pointing those out so we can gather them together! The more holistic view we have of the situation the better.

I hope to identify the most pressing technical debt issues (I *think* it's at least those two I mentioned in the description) so that we can formulate a plan for addressing them. Hence this task :)

So is this a code stewardship review then? In any case, we usually don't turn things off when they're getting reviewed.

Maybe this *also* should be a Code Stewardship review, good point. :) The intertwining of systems and people and processes is what is the hard part. Technically the Language Team is already responsible for part of this code base and twn has maintainers.

No doubt we need to redo the architecture of l10n cache more broadly and Core Platform Team has/will resource (AIUI from our discussion a few weeks ago) design work on what it should look like

I wasn't clear/sure on that given the Language Team is the de facto stewards/maintainers of the relevant code bases. Is there something more official/written down we can reference on that topic?

l10nupdate is part of scap. When I last reviewed the open tickets as part of the RfC last year most of the problematic stuff was more on the Deployments side, IIRC.

Due to how it is architected, yes, requiring a larger rethink.

There is a larger issue that the MediaWiki localisation cache system doesn't meet our current deployment needs in terms of speed, but I think that's tangential to the l10nupdate issues. I think if it took seconds to deploy new i18n messages, then we'd feel more comfortable running it every day (T203737) since reverts would be easier.

The linked task does not fully support the claim (just the desire). The more relevant task would be something like T205313 or so...

Our hope is to increase the speed/cadence of deployments via the deployment pipeline program, and thus it would make sense to tie localization updates directly to the deploy in question. But that would require some re-thinking of this system as well.

I think the best problem statement so far is https://www.mediawiki.org/w/index.php?title=Wikimedia_Platform_Engineering/MediaWiki_Core_Team/Backlog&oldid=1267452#Localisation_cache_do-over - I'm not sure if there's anything formal written down about Core Platform Team doing design work besides T158360#4638816.

That's useful, thank you.

Nikerabbit added a comment.EditedOct 16 2018, 3:29 PM

Localisation Update (LU)

Developers work on features continously, and this means there will be new user interface messages. Because Wikimedia Foundation mission is explicitly heavily multilingual, it is imperative to enable users to access and modify the content in their own language. This is done by translating the user interface messages.

The core idea of LU is to deliver translation updates out-of-band with much higher frequency than regular code deployments. Underlying this idea is the notion that that translations are more like data than executable code, which allows us to skip the regular deployment pipeline fully or partially. LU aims to minimize the time users see untranslated interfaces messages without slowing down deployment of new features. It is common in other software projects to have code freezes before releases to give translators time to complete translations.

Benefits of LU

  • It does not slow development like code freezes would
  • It allows volunteer translators to work whenever they want without potentially stressful deadlines
  • It minimizes the time users see untranslated interface messages
    • This makes us agile (allowing to do things like FixCopyright campaign)
    • This makes our users have better user experience, given lack of translations is a huge accessibility issue to many users
  • It leads to faster translate-review-fix cycle producing higher quality translations
  • It discourages people to make translations locally on the wikis, which have many drawbacks:
    • Changes to translations on local wikis are not tracked
    • There is small performance impact on local translations
    • Changes done locally and not available to all other wikis

How LU works in high level

LU requires three components which I call translation processor, comparator and integrator.

Translation processor makes new and changed messages available for translators. It is important that the processor knows which translations are missing or need updating. These days many open source projects use online platforms that reduce technical and process barriers for translators to contribute.

Comparator picks suitable new translations produced by translation processor. It uses two criteria to define what is suitable:

  1. New translation must be compatible. This is done by checking it is made against source text that is compatible with the currently deployed source text.
  2. New translation must be different from currently deployed translation.

Integrator takes the set of new translations provided by comparator and deployes them.

How LU works currently

Our translation processor is translatewiki.net. It has a cron job that currently runs three times per day to check message changes in VCS repositories. The output of this job is reviewed by @Raymond to handle message renames and automatic replacements. This means the usual delay of new messages being introduced to having them available for translation is less than a day.

@Raymond also exports new translations to VCS repositories with a script. This usually happens on European evenings, once a day every day of the week.

Our comparator is the LocalisationUpdate extension. It has its own clones of MediaWiki core and deployed extensions and skins. It uses update.php (not the core update.php!) to run pick suitable messages for all deployed branches, saving them as JSON file per language.

Our integrator is a combination of LocalisationUpdate extension and a bash script. It works as follows:

  1. It runs the comparator
  2. It runs rebuildLocalisationCache.php (which picks up the files created by comparator using a hook) and outputs new cache files to a temporary location
  3. It then copies these new cache files to mediawiki-staging (this is where code changes are deployed)
  4. It then runs scap cdb-json-refresh which creates json and md5 hashes of the files
  5. It then transfers those json files to all servers, and rebuild the cdb files on the servers based on those

It runs (well, not running currently) daily Monday to Thursday on one of WMF deployment servers.

What are the current problems with LU?

It is deemed as fragile and unmaintained – I think mostly because only few people understand how it works. The RFC lists some issues, which I have commented below:

LU [...] effect is only visible until the next scap run.

Based on my reading of the code, the updated JSON files stay in the staging area, so it is not clear to me what could cause this or whether it is even true anymore. In any case this looks like a simple fix.

Deployments by LocalisationUpdate happen at a time when not many people are around.

It can be moved to a slot where more people are around. We can also find volunteers to run the script manually, but I question whether this is a good use of time.

There is the potential of staged but not-yet-deployed changes on WMF deployment servers being deployed unknowingly.

As far as I know this state is now detected automatically and causes alerts. In any case the script can be updated to check for this and abort.

The extension has often been broken for days or weeks due to a lack of maintenance and ownership. No one is responsible for the code, those who have to support it (Release Engineering) do not have an adequate understanding of it.

Basically, Language team is supporting the extension, of which only update.php with filesystem backend (comparator) is used by the WMF setup. The GitHub backend used by 3rd parties is known to be broken at the moment. It is also super slow. Release Engineering should be responsible for the integrator part.

It is triggered by legacy shell scripts that, unlike all other deployment tools, are not part of Scap.

The legacy shell script is 150 lines of bash, which should be possible to port to scap with a reasonable effort.

However l10nupdate doesn't always run as expected, either for unlogged/unidentified issues with the deployment system

This can be improved with better logging and better documentation of the system.

Possible improvements

Reduce l10n format conversions

The dance with file formats from json to cdb to json again feels unnecessarily complicated. We should explore options such as PHP-arrays that are both fast to read (due to byte-code cache, might have issues with HHVM) and fast to sync.

Reduce the size of l10n caches

Current localisation cache flattens fallback chains. This means we are probably storing English translations hundreds of times. We should re-evaluate whether flattening the fallback chain is still worth the speed improvement on reading compared to having a smaller dataset but possibly requiring to check multiple files per message (fallback chains are usually short 2-4 items).

Commit updated translations to VCS deployment branches

Instead of playing with the cache files, we could simplify integrator by having the comparator commit the translation updates to a VCS repository (a single file per language should be sufficient, not to each repository separately). Again, a hook or loading order would be used that these translation updates override the already deployed translations. This would remove issues with file-system permissions, and would allow rollback using regular mechanisms. It would also allow running the comparator outside WMF deployment server if wanted.

Make LU part of all full scaps

Any time messages change, we might risk accidentally overwriting existing translations with incompatible ones. If we can make LU fast enough (more parallelization?), it could be part of the regular scap process to ensure this won't happen, and that all scaps would deploy latest translations.

Import messages more often to translatewiki.net

We could make the cron job more frequent. If we also enable --safe-import then it wouldn't depend on human effort for simple cases. This increases the time translators have for translating before deployment a little bit.

Export messages more often from translatewiki.net

We could increase and automate the exports from translatewiki.net to be more frequent than once per day. But this doesn't really help unless we plan to run LU more often than once a day.

What we can do is to try to sync LU and exports time so that they are closer to each other.

Considerations for long term

I think the current process where updates are committed to VCS repositories, and having LU read the translations from there, is good. It would be possible to skip VCS and request updates directly from translatewiki.net, but this has several downsides:

  • Additional load for translatewiki.net. Twn APIs are not prepared for this kind of requests, and it wouldn't scale
  • An unnecessary dependecy on third party service
  • Would need to figure out a way to do security checks that are currently handled by WMF CI

Having said that, the GitHub backend for 3rd party user is so slow and broken it is useless. There are plans to make a generic service that would do the comparator part for any MediaWiki, or for any software for that matters. It would need some planning to make it fast, but it can also work with files from VCSs.

Translatewiki.net used to have support for exporting translations to release branches. However, restoring this functionality would be quite a bit of work. The main difference from a generic LU service is, that it allows "forking" messages when they go through incompatible changes. This way translators can still provide translations for older releases. For WMF-deployments this is not going to matter much, but for 3rd party users using release branches, this might have a bigger impact.

Further reading

https://wikitech.wikimedia.org/wiki/LocalisationUpdate (partially out of date)
https://www.mediawiki.org/wiki/Extension:LocalisationUpdate
https://www.mediawiki.org/wiki/Requests_for_comment/Reevaluate_LocalisationUpdate_extension
https://www.mediawiki.org/wiki/Language_goals_and_wishlist#Continuous_translation_update_service_provided_by_translatewiki.net
https://www.mediawiki.org/wiki/Language_goals_and_wishlist#Integrate_continuous_translation_updates_to_MediaWiki_core
https://www.mediawiki.org/wiki/Language_goals_and_wishlist#Automated_exports_from_translatewiki.net
https://www.mediawiki.org/wiki/Language_goals_and_wishlist#Support_translation_of_multiple_software_branches

Conclusions

Current approximate time usage for LU process (67 minutes in total):

  • 1 minutes: update git repos and run update.php (comparator)
  • 18 minutes: rebuild l10n caches (should be faster with PHP?)
  • 1 minutes: CDB to JSON conversion
  • 1 minutes: copy JSON files to mediawiki-staging
  • 8 minutes: sync JSON files to servers
  • 5 minutes: JSON to CDB conversion
  • Repeat above for second deployment branch
  • 11 minutes: clearing message blob storage

General improvements to MediaWiki l10n handling core "Reduce size of l10n caches" and "Reduce l10n format conversions" would benefit LU by making things simpler and faster, even though not directly related to LU.

I think we could get quick wins by doing the following:

  1. Commit updated translations to VCS deployment branches
    1. This would make things less opaque.
    2. Might have issues integrating with gerrit though.
  2. Integrate with scap
    1. This would increase maintainability, and hopefully also logging
    2. This would make it easy to have LU part of all scaps that do l10n-updates. The time spend in the comparator itself is minimal. Possibly then LU update would be just a regular full scap, which would be a code path that is regularly exercised and thus reliable.

I see no need for major rearchitecturings of LU at this point of time to support WMF use case (as opposed to 3rd party). Moreover, I don't see any changes required in the LocalisationUpdate extension itself, only in the WMF integrator part.

Finally, any reason to not just merge this task to the RfC and have it go through that process?

greg added a comment.Nov 9 2018, 10:36 PM

Thanks, @Nikerabbit for the write up. I think this is a good overview and reference point. Of course there are points to which people may disagree but even with that it is sufficient to work from.

Regarding possible improvement, we should also include addressing the bus-factor inherent in the current process (namely, @Raymond's daily work); thinking of ways to better automate that seem like a good idea.

Otherwise I think the short list is good (if you add automating the daily manual twn work).

Yes, I think we can now resolve this task and work on the main RFC task.

It seems like, to do this correctly, we will need to make a planning decision in the next few months (to line up with WMF annual planning cycles) for how to effectively split the work up. Do I understand correctly that you suggest the parts related to the integrator to be own by the Release Engineering team and the other parts (comparator and translation processor) to be owned by either the Language Team or twn directly?

If that's right, then that means the "integrate with scap" would logically fit within RelEng while "commit updated translations to VCS" and "automate daily twn processes" would be.... Language Team and TWN, respectfully?

Thanks again.

Regarding possible improvement, we should also include addressing the bus-factor inherent in the current process (namely, @Raymond's daily work); thinking of ways to better automate that seem like a good idea.

I left that part out of my write-up. I have been actively working on to streamline and document the process. Fully automated exports are tracked in T103258: Automatically export translations daily without needing the TWN staff to do it (not all past progress is reflected there). Current pain points are on importing: handling message key renames and tagging optional/ignored. In any case we have bus factor of 3 currently, and can increase that in the future.

It seems like, to do this correctly, we will need to make a planning decision in the next few months

+1

Do I understand correctly that you suggest the parts related to the integrator to be own by the Release Engineering team and the other parts (comparator and translation processor) to be owned by either the Language Team or twn directly?

Translation processor is clearly a responsibility of translatewiki.net, integrator and comparator are a bit fuzzy depending on the details. We could split what you propose, but I would prefer if we had people from both teams at least familiar with the whole setup (to be able triage and diagnose issues, etc.).

@Nikerabbit Do you think this should be merged with T158360? I can see a use for a "prepare the RFC" task as a separate one (this one). On the other hand, the collaboration with different stakeholders and input from others might be non-trivial and perhaps better to occur on the RFC stage :). Let me know, it's fine either way.

Also, what is the next step here, and who is expected to take that next step? E.g. Language engineering to further the proposal, or Release Engineering to implement or review part of it, or Core Platform to help implement a prerequisite?

Krinkle moved this task from Inbox to Backlog on the TechCom board.Jan 2 2019, 8:30 PM

@Krinkle I am planning to update that RFC in the near future.

I'm similarly confused. Some of the language in this task's description is kind of stilted and difficult to follow, but it feels like localization updates are now being held hostage.

Relevant quip:

But end this shutdown now. Thank you.

Hi all -- I just wanted to chime in to say that the WMF Growth team is eager for updates to go back to being nightly, which would help our team build and iterate on our features faster. Thanks for thinking about this!