Page MenuHomePhabricator

Move special page alias translations to JSON
Closed, ResolvedPublic8 Estimated Story Points

Description

The names of special pages historically have been translated using a PHP file. Its proposed that we move to a translatewiki-based approach using JSON files.

The current setup poses problems for code maintainers as engineers are often merging code that contains translated strings of languages they do not understand and translators are required to understand basic PHP syntax. In addition to this, these files do not enjoy the same extent of translation that the translatewiki-based translations do. For example, the Special:Nearby page has 45 translations in alias file, whereas the exact same string inside the i18n folder for mobile-frontend-nearby-title message enjoys 148 unique translations via translatewiki. Let’s pay off this technical debt and improve translations for our projects.

Current Implementation

Added a new configuration parameter: TranslationAliasesDirs that takes a directory as input. The directory is expected to have per language JSON files containing special page aliases. Example:

i18n/aliases/ar.json
{
	"SpecialPageAliases": {
		"NotifyTranslators": [
			"إخطار_المترجمين"
		],
		"TranslatorSignup": [
			"اشتراك_المستخدمين"
		]
}

The TranslationAliasesDirs can be defined in extension.json:

extension.json
...
"TranslationAliasesDirs": {
		"TranslationNotificationsAlias": "i18n/aliases/"
},
...

We've added a maintenance script (ConvertExtensionsMessagesToTranslationAlias) that can convert existing ExtensionMessagesFiles to individual per language JSON files in the format accepted by TranslationAliasesDirs

Relevant patches:

  1. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/977085 - Patch in core that adds the TranslationAliasesDirs and the ConvertExtensionsMessagesToTranslationAlias maintenance script.
  2. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TranslationNotifications/+/977084 - Patch in TranslationNotifications extension to use TranslationAliasesDirs configuration parameter.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 787835 had a related patch set uploaded (by Jdlrobson; author: Jdlrobson):

[mediawiki/core@master] Alias support via message key via LocalisationCache

https://gerrit.wikimedia.org/r/787835

I had a look at the LocalisationCache today. Hopefully this is more what you had in mind?

@Nikerabbit any thoughts on this, or is there someone you could delegate this task to help me move this forward?

I've submitted a Localization infrastructure needs request to the language team for help with this sometime soon.

Using a string, with \n for splitting, especially for when mixing rtl and ltr, feels like a step backwards.

And performance wise, repeatedly splitting a string doesn't feel the best, especially if we can store it in an array and making it more easily editable...

Using a string, with \n for splitting, especially for when mixing rtl and ltr, feels like a step backwards.

Can you expand on this? We do this in MediaWiki:Sidebar, so I was looking to retain existing norms.

And performance wise, repeatedly splitting a string doesn't feel the best, especially if we can store it in an array and making it more easily editable...

What would you suggest instead? This seems to be the preferred option by TranslateWiki (T89947#7864804). My understanding with LocalizationCache is that it runs once on startup (but I may have misunderstood that?).

An alternative approach would be to add an additional step to the i18n bot build process where a specialPageAlias.json is generated from the i18n/en.json however that would require more work on TranslateWiki side.

I'm open to any ideas you have, but it's clear to me that the PHP-based translations are not getting the level of translation that our other messages do, which is understandable since not all translators know PHP, so I'm keen to make our software more friendly in this regard.

Using a string, with \n for splitting, especially for when mixing rtl and ltr, feels like a step backwards.

Can you expand on this? We do this in MediaWiki:Sidebar, so I was looking to retain existing norms.

Yes, but it's designed to be edited on the end under wikis, which is a different use case here. And people are not entering Page1\nPage2\n\Page3, they're putting newlines where they'd expect.

And the Message is then run through a parser, to create the sidebar links. They're not split up to be used seperately.

Though presumably it would be edited onwiki on twn, but could be transformed inflight (or during export?) from a list of strings into the array format.

Take the string "special-page-alias-MobileOptions": "הגדרות_נייד\nגדרות_פלאפון\nהגדרות_סלולרי",

Try and remove the "last" character before an \n. Does it do what you expect?

Editing any mixed ltr and rtl string is often painful.

While some people will edit via translatewiki, some will still edit the code files directly. And people will read from it.

And performance wise, repeatedly splitting a string doesn't feel the best, especially if we can store it in an array and making it more easily editable...

What would you suggest instead? This seems to be the preferred option by TranslateWiki (T89947#7864804). My understanding with LocalizationCache is that it runs once on startup (but I may have misunderstood that?).

It was the preferred option of those presented, and Niklas said "or variant thereof", which doesn't necessarily mean "exactly that" format :).

The string would still need to be split to be used, to (eventually) work out if the called (special page) title is a valid Special page... Including any fallback chain etc...

It would seem keeping the array value makes the output simpler to read (rather than a human having to interpret literal newline characters in the middle of the string), vs a clear list/set of strings like using an array would provide.

"special-page-alias-MobileOptions": [ "הגדרות_נייד", "גדרות_פלאפון", "הגדרות_סלולרי" ],

How things are going to be edited onwiki is obviously different to how it will be stored "ondisk".

My preference is still option 1 for the format, but for implementation, I think it would be best to amend LocalizationCache to read these files directly (it already does it for PHP format) and bypass MessageCache/Message completely as unnecessary coupling and complexity (e.g. the fallback rules for messages are very complex).

^ If we're wanting to read the files directly... Putting them in the existing message json files isn't going to work, as they're large files to reparse just for this purpose.

It does sound like Niklas' interpretation to how this should be implemented is different to how your patch is implementing it (you're using the existing messages as loaded at that point, not "amend LocalizationCache to read these files directly (it already does it for PHP format)"); which is doing a fairly expensive search as it is looking through *every* message (mw cores en.json has over 4000 messages, without looking at the API, exif etc) for the text special-page-alias-. Then we have extension and skin messages...

It sounds like Niklas is suggesting to put them in some seperate file (what exactly, or where, I'm not sure).

WMF Production:

reedy@deploy1002:/srv/mediawiki-staging/php-1.39.0-wmf.27/cache/l10n/upstream$ wc -l l10n_cache-en.cdb.json 
27413 l10n_cache-en.cdb.json

For some wikis (not the WMF), the localisation cache is built on the fly during web requests if it doesn't exist, or it's out of date. This potentially adds a performance penalty to these requests.

It would seem a little excessive (though, it makes sense) to have individual files for each language, to just have potentially one string in them.

It is unclear what Niklas means by "read these files directly", so would be helpful if he could clarify...

I also don't quite understand "turning the files into JSON" (which most translators don't touch directly anyway, they don't care what format it is in; nor should they) suddenly re-enables translation of these. Part of the blocker is T109235: Re-enable Special:AdvancedTranslate on translatewiki.net; the interface isn't currently useable, which is part of the reasons translations aren't more forthcoming; as it's filing tasks on Phab and patches being made on gerrit. The related logic that Niklas documents would need reimplementing too into a different workflow.

There's also tooling around that, telling people that the "Translation" of these strings doesn't just have to be a literal copy/translation of the source (English) string, synonyms would be allowed.

I'm obviously not saying we shouldn't be moving them to JSON (this is obviously a pattern we generally strive for for "data"), it just doesn't seem to necessarily help (nor does it really hinder) the process at this point, if the other requirements/tools aren't already in place.

See also: T220759: Provide an alternative to wgExtensionMessagesFiles for non-message i18n.

I think @Reedy's last comment is quite spot on. The file format is probably the easiest thing here when considering the full workflow for having these translatable in translatewiki.net

My idea rough idea for what that could look like is:

  • Convert the current PHP files to JSON (because JSON is easier to work with)
  • Teach LocalisationCache that special page aliases can also be in a JSON format
  • Add file format support for this new JSON format in Translate (if needed)
  • Add semi-automatic message group creation
  • Add lots and lots of validations and restrictions for translations to avoid breakage when changes are not subject to pre-review.

These should not be mixed with JSON files containing translations, that just adds unnecessary complexity.

Thanks for the suggestions. I'll take another pass :)

Change 773887 abandoned by Jdlrobson:

[mediawiki/core@master] Alias support via message key

Reason:

https://gerrit.wikimedia.org/r/773887

Change 773889 abandoned by Jdlrobson:

[mediawiki/extensions/MobileFrontend@master] MobileFrontend uses newly proposed alias format

Reason:

https://gerrit.wikimedia.org/r/773889

Change 829084 had a related patch set uploaded (by Jdlrobson; author: Jdlrobson):

[mediawiki/core@master] Allow special page aliases to load from JSON

https://gerrit.wikimedia.org/r/829084

Change 829085 had a related patch set uploaded (by Jdlrobson; author: Jdlrobson):

[mediawiki/extensions/MobileFrontend@master] Convert PHP messages to JSON

https://gerrit.wikimedia.org/r/829085

I share the aforementioned concerns. It is vital for long-term platform stability that these values not be freely translatable, locally overridable, or otherwise be subject to the general interventions and freedoms that we provide to interface messages. They form part of canonical URLs. I'd describe them as one step away from domain names. Rarely added, slowly and carefully changed, and kept indefinitely.

The prevoiusly proposed format through the messages files would introduce a long-term source of conflicts and leakage at every level, each of which would need to be plugged. The most difficult one to plug, however, is the human layer. Both translators and developer on-boarding would be complicated by conflicting expectations and prior knowledge. E.g. an entire ecosystem of documentation and expectations mentioning interface messages would be retroactively invalidated unless followed by "... but not for X, Y and Z that we treat differently".

In some cases, such compromises are hard to avoid. Ii even fewer times, they might be worthwhile if there is a singificant benefit to be enjoyed at a higher level from that approach. In this case, however, the compromise is neither hard to avoid (we can simply pick another file or format), nor resulting in benefits as by design nearly every possible benefit through localisation-related systems and translatewiki would in fact be a bug; as these are not meant to interact with any general message-related localisation features.

Niklas's suggestion reads to me as a summary of the 2017 RFC outlined here:
https://www.mediawiki.org/wiki/Requests_for_comment/Move_i18n_data_into_JSON

The above page has concrete details and pointers that might help you with the implementation.

I share the aforementioned concerns.

To be clear. I have abandoned the old approach using new line-separated messages. This was a misunderstanding on my part of what we wanted that Reedy has helped clarify (thanks @Reedy!).

The above page has concrete details and pointers that might help you with the implementation.

The patch I posted after my last comment (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829084) uses a single JSON file. It's quite a simple change, and might be a useful first step in that direction with value as JSON files are easier to modify than PHP IMO. Please see example of how that would look in MobileFrontend).

It's unclear to me if we want to go further than that and merge it into the existing JSON files per https://www.mediawiki.org/wiki/Requests_for_comment/Move_i18n_data_into_JSON#Proposal. I defer to @Nikerabbit on that one. That seems a bit riskier. The RFC proposal is vague as it implies addition to the existing messages but I'm not sure how compatible that is with translatewiki's existing scripts.

I'm not looking for implementation pointers at this point - simply clarification of the specification.

Jdlrobson updated the task description. (Show Details)

Change 787835 abandoned by Jdlrobson:

[mediawiki/core@master] Alias support via message key via LocalisationCache

Reason:

Please see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829084 for the latest iteration of this patch.

https://gerrit.wikimedia.org/r/787835

Hi @santhosh do you have any thoughts on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829084 ? In particular I'm keen to know what's blocking us from merging it since it seems a highly meaningful change that could drive more translations to special page aliases. Given localization of special page aliases is currently done manually by PHP, it seems no different to switch the format to a manual JSON. I'd like to try this out in MobileFrontend and help work on some tooling on the translatewiki side. If you want to take a completely different approach let me know and I'll back off entirely. Thanks in advance.

@Jdlrobson Sorry we haven't gotten time to look into this yet. This topic is in our short list. My main hesitation here is about reviewing this patch in isolation without having clarity on the long-term end state we want to achieve, to avoid generating unnecessary churn if we need to change things once again.

Thanks for the update! Glad to know that it's still in your minds. When would be a good realistic time to follow up on this?

Nikerabbit raised the priority of this task from Low to High.Apr 4 2023, 12:03 PM
Nikerabbit set the point value for this task to 8.Apr 4 2023, 12:15 PM

Change 829085 abandoned by Jdlrobson:

[mediawiki/extensions/MobileFrontend@master] Convert PHP messages to JSON

Reason:

I'll restore this when https://gerrit.wikimedia.org/r/q/Iba57511e334e282f684b3e64a3e6a619105baab0 has got more traction.

https://gerrit.wikimedia.org/r/829085

abi_ changed the task status from Open to In Progress.Jul 4 2023, 2:16 PM
abi_ claimed this task.

Change 977084 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/TranslationNotifications@master] Add special page aliases in ExtensionMessageJsonDirs

https://gerrit.wikimedia.org/r/977084

Change 977085 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/core@master] Add ability to define configuration parameters in JSON

https://gerrit.wikimedia.org/r/977085

I've discussed this with Niklas, and here's my proposed plan, taking into account our intention to utilize translatewiki.net for localisation:

1. Deciding a format to store the Special page aliases

  • We want to use JSON for this file format, same as
  • We want to use separate files for each language because:
    • Difficult to add support for this on translatewiki.net
    • Difficult to track changes via git as the single file goes via a lot of churn
    • A single file causes issues with parallel exports of language amongst other things

2. Updating MediaWiki core

  • MediaWiki core needs to be made aware that Special page aliases are now in JSON files and use those instead.

2.1 Updating LocalisationCache to handle the said format

  • We do not want to parse the JSON file everytime.
  • Update LocalisationCache to parse the file and store it in the cache

3. Update Translatewiki.net

Custom message group

  • Setup custom message group to handle this new format.

Validations

  • Strict validations in place for the custom message group

For the 2nd point above, I've submitted a proof-of-concept:

  1. 977085: Allow defining configuration parameters in JSON | https://gerrit.wikimedia.org/r/c/mediawiki/core/+/977085
  2. 977084: Add special page aliases in ExtensionMessageJsonDirs | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TranslationNotifications/+/977084 (Uses the patch above)

Sample format

bn.json
{
	"specialPageAliases": {
		"NotifyTranslators": [
			"অনুবাদককে_বিজ্ঞপ্তি",
			"অনুবাদকের_জন্য_বিজ্ঞপ্তি"
		],
		"TranslatorSignup": [
			"অনুবাদকের_নিবন্ধন"
		]
	}
}

Why are we introducing a new configuration parameter?

There are two other variables that we reviewed: MessagesDirs and ExtensionMessagesFiles

  • MessagesDirs: Only works for interface messages and expects message keys to be the base key in the JSON. This would be a problem if we wanted to use a JSON file for localisation of more than one configuration parameter.
  • ExtensionMessagesFiles: Expects to read a PHP file and all the different language codes are present in a single file.

Why are we using an array as values?

If you consider the two strings:

  1. "הגדרות_נייד\nגדרות_פלאפון\nהגדרות_סלולרי"
  2. [ "הגדרות_נייד", "גדרות_פלאפון", "הגדרות_סלולרי" ]

Its much easier to read the second string.
In addition ingesting the string is easier and does not require a split with \n

Change 829084 abandoned by Jdlrobson:

[mediawiki/core@master] Allow special page aliases to load from JSON

Reason:

See https://gerrit.wikimedia.org/r/c/mediawiki/core/+/977085

https://gerrit.wikimedia.org/r/829084

The proposal makes sense to me. How can I help? What do you see as the blockers for doing this?

I think this would be of interest to wikitech-l so would recommend pointing people towards this Phabricator ticket!

The proposal makes sense to me. How can I help? What do you see as the blockers for doing this?

I think this would be of interest to wikitech-l so would recommend pointing people towards this Phabricator ticket!

Thanks.

I don't see any blockers, but would like some help with code reviews of the proof-of-concept patch. We can iterate on it to turn it into something that can be merged in.

Also I'm not so sure of the new config variable name...so suggestions on that front will be helpful.

I'll post this to wikitech-l.

With regards to naming:

  • "extension" is bad since I think this should work for MediaWiki core as well
  • "json" does not differentiate from "messages"

Some alternatives to consider:

  • Localization(Dirs) like Messages(Dirs) with messages being subset of Localization
  • ComplexMessages(Dirs) or SpecialMessages(Dirs) highlighting the fact that these require special attention and work differently from plain UI messages
  • Aliases(Dirs), given those are called namespace aliases, special page aliases. We don't usually call magic words as magic word aliases, but I don't think it's wrong since the English Names always work. Aliases may be hard to understand alone though, so could also be AliasTranslations(Dirs) or TranslationAliases(Dirs)

With regards to naming:

  • "extension" is bad since I think this should work for MediaWiki core as well
  • "json" does not differentiate from "messages"

Some alternatives to consider:

  • Localization(Dirs) like Messages(Dirs) with messages being subset of Localization
  • ComplexMessages(Dirs) or SpecialMessages(Dirs) highlighting the fact that these require special attention and work differently from plain UI messages
  • Aliases(Dirs), given those are called namespace aliases, special page aliases. We don't usually call magic words as magic word aliases, but I don't think it's wrong since the English Names always work. Aliases may be hard to understand alone though, so could also be AliasTranslations(Dirs) or TranslationAliases(Dirs)

I've updated the patch to use TranslationAliasesDirs

I share the aforementioned concerns. It is vital for long-term platform stability that these values not be freely translatable, locally overridable, or otherwise be subject to the general interventions and freedoms that we provide to interface messages. They form part of canonical URLs. I'd describe them as one step away from domain names. Rarely added, slowly and carefully changed, and kept indefinitely.

How is this being handled in the current proposal?. As @Krinkle states, these messages should usually only be translated once, or URLs will break.

I'm a bit concerned that we are exposing the ability to break site functionality to TW users, where usually such a change would have to go through code review.

I share the aforementioned concerns. It is vital for long-term platform stability that these values not be freely translatable, locally overridable, or otherwise be subject to the general interventions and freedoms that we provide to interface messages. They form part of canonical URLs. I'd describe them as one step away from domain names. Rarely added, slowly and carefully changed, and kept indefinitely.

How is this being handled in the current proposal?. As @Krinkle states, these messages should usually only be translated once, or URLs will break.

I'm a bit concerned that we are exposing the ability to break site functionality to TW users, where usually such a change would have to go through code review.

Strictly speaking, translatewiki.net integration is out of scope for this task, though that remains the end goal. For this step, the main benefit is storing data in non-executable format. Integration with translatewiki.net will not be added until we have validations at least as strict as we used to have (see e.g. my list in T89947#7685561).

In the current proposal, these translations won't appear as interface messages in the local wikis. They are kept separate as they currently are.

Change 1003029 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/extensions/Wikibase@master] [POC] Convert PHP array of extension messages to JSON

https://gerrit.wikimedia.org/r/1003029

Change 998274 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] [POC] Add extension, skin, config for alias directories

https://gerrit.wikimedia.org/r/998274

Change 1003034 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/extensions/Scribunto@master] [POC] Convert PHP array of extension messages to JSON

https://gerrit.wikimedia.org/r/1003034

Change 1009153 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/TwnMainPage@master] Use TranslationAliasesDirs to specify special page aliases in JSON

https://gerrit.wikimedia.org/r/1009153

Change 977085 merged by jenkins-bot:

[mediawiki/core@master] Add TranslationAliasesDirs to specify special page aliases in JSON

https://gerrit.wikimedia.org/r/977085

Change 1009153 merged by jenkins-bot:

[mediawiki/extensions/TwnMainPage@master] Use TranslationAliasesDirs to specify special page aliases in JSON

https://gerrit.wikimedia.org/r/1009153

We've updated the TwnMainPage (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TwnMainPage/+/1009153) to use the new TranslationAliasesDirs configuration option. The relevant patches have been deployed on translatewiki.net. Did not notice any issues.

Leaving this open to ride the train next week and monitor for issues.

Change 977084 merged by jenkins-bot:

[mediawiki/extensions/TranslationNotifications@master] Define special page aliases with TranslationAliasesDirs

https://gerrit.wikimedia.org/r/977084

The latest changes have been deployed on production wikis for a few days now, and no issues have been reported. The TranslationNotifications extension is using the new format of defining special page aliases.

Note: mergeMessageFileList.php needs to be updated for wmf-hosted sites and wiki farms using manualRecache for $wgLocalisationCacheConf