Page MenuHomePhabricator

Split core en.json to several files
Open, LowPublic

Description

Some time in 2010 or so I raised the idea of splitting the core translation file (then MessagesEn.php) to several files to make it easier for translators. The basic idea is that it's easier to approach the translations as several smaller groups rather than one large group.

Back then it had about 2700 messages. @siebrand and @Nikerabbit were not enthusiastic about it, and said that it's not worth the effort. (We discussed it in person at the 2011 Berlin Hackathon, and possibly in writing on some mailing lists or Bugzilla tasks, but I cannot find it now.)

A few things changed since then:

  • It went up from 2700 to 3800. In fact, it's over 4000 if you count the optional and ignored messages.
  • We transitioned from PHP to JSON.
  • In practice we already have several separate en.json files: the core itself is split to Core, API, and Installer, and there are also separate repos for skins.
  • translatewiki.net configuration files are not that hard. (I don't quite know how did they look in 2010, to be honest, but I do know them now, and they aren't terrible.)

As far as I know, splitting a group is a matter of:

  • Finding a group of closely related messages, making sure that no information is lost compared to the current subgroups of messages en.json contains (T162172#3280030).
  • In the core repository (example):
    • Moving the relevant messages to a new en.json and qqq.json while keeping all the message keys identical. Unless there's a reason to do it differently, the new files should be under languages/i18n/new-group-name/en.json.
    • Adding an entry for the new file to function getMessagesDirs() in includes/cache/localisation/LocalisationCache.php.
    • Adding an entry for the new file to the banana section in Gruntfile.js.
  • In the translatewiki repository:
    • Adding a new group in groups/MediaWiki/MediaWiki.yaml and moving the ignored and optional messages into it (example).
    • Adding the new group to the appropriate aggregate "used by Wikimedia" group, such as groups/MediaWiki/WikimediaMainAgg.yaml or WikimediaTechnicalAgg.yaml. (example).
    • Adding the new group to the mediawiki:/group: section in repoconfig.yaml (example).
  • Doing a new export so that the translations are moved as well.
  • (Did I miss anything? Does anything need to be updated also in the scripts for synching translatewiki with Gerrit?)

I'm not talking about splitting it to 50 groups, but some initial groups I can think of are:

  • definitely the exif tags (about 380 messages)
  • maybe calendars (not only Gregorian, but also Hebrew, Persian, days of week, etc.)
  • maybe log messages
  • I haven't given this much thought yet, but perhaps the ignored messages could be moved to a separate file. That file would simply be not loaded to translatewiki, and then we could remove the long "ignored" list from the translatewiki configuration (269 items at the moment). But that's really a separate issue to discuss.
  • Possibly some more.

I can do it myself some time as a pet project. This task is a kind of an RFC: Are there caveats that I am missing? Is it harder than I imagine? Is anybody opposed to it for any reason?

Event Timeline

Amire80 created this task.Jun 13 2017, 8:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2017, 8:07 AM
Nikerabbit added a comment.EditedJun 13 2017, 8:20 AM

For the developer it should be very obvious to which file a new and existing messages belong to. Something like 7 groups would still be very manageable.

The place of these new directories could have some thought. If they are all together, it would be easy to see the options. If they are in different places (like now with core and installer), the alternative are harder to find, but those would naturally guide which i18n file to use for messages related to that component.

Technically, splitting is rather trivial, except for the initial effort of moving messages and translations to correct files.

I am assuming no message keys will be renamed. There is slightly increased chance of accidental message key collisions, but those are rare currently with core and all the extensions sharing a single namespace.

The benefit would mostly be for translators: they could better focus on which parts to translate.

As a developer, it should be very obvious to which file a new and existing messages belong to. Something like 7 groups would still be very manageable.

Yes, that's more or less the number I'm thinking about.

The place of these new directories could have some thought. If they are all together, it would be easy to see the options.

Yes, probably.

I am assuming no message keys will be renamed. There is slightly increased chance of accidental message key collisions, but those are rare currently with core and all the extensions sharing a single namespace.

Yeah, and most messages recently have a prefix anyway.

The benefit would mostly be for translators: they could better focus on which parts to Translate

Indeed, that's the intention.

Nemo_bis updated the task description. (Show Details)Jun 13 2017, 8:36 AM
Nemo_bis added a subscriber: Nemo_bis.
waldyrious added a subscriber: waldyrious.
Nemo_bis triaged this task as Low priority.Oct 15 2017, 7:25 PM
Af420 added a subscriber: Af420.Mar 19 2018, 2:33 PM
jhsoby added a subscriber: jhsoby.Aug 8 2018, 12:00 AM
jhsoby added a comment.EditedAug 8 2018, 12:06 AM

Some suggestions for possible groupings:

  • Reader messages – messages that are seen by casual readers who don't touch the edit buttons or special pages at all
  • Editor messages – messages that are seen by people who edit, but are not necessarily logged in
  • User messages – messages that are seen by normal registered users (like preferences, etc)
  • Privileged messages – messages that are seen by people with special rights (patrollers, admins, etc)
    • (Maybe even split this one into different rights – for example, the vast majority of checkuser and abusefilter messages (and there are many of them!) are only for a very few select users, so translating those should probably be lowest priority, even much lower than messages for admins)
  • API messages – messages that are never seen by anyone 😜
Amire80 updated the task description. (Show Details)Sep 5 2018, 12:04 PM

Change 458165 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] WIP Move exif messages to a separate i18n file

https://gerrit.wikimedia.org/r/458165

Change 458165 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] WIP Move exif messages to a separate i18n file
https://gerrit.wikimedia.org/r/458165

One thing to consider here, is that the XMP parser very softly depends on these (along with stuff in core which depends on these), which is now a separate library. It'd be cool if we could split this out in someway that you still get these messages if you use the XMP library independently.

Change 458165 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] WIP Move exif messages to a separate i18n file
https://gerrit.wikimedia.org/r/458165

One thing to consider here, is that the XMP parser very softly depends on these (along with stuff in core which depends on these), which is now a separate library. It'd be cool if we could split this out in someway that you still get these messages if you use the XMP library independently.

Thanks for the comment! If I understand correctly, this sounds sensible, but I'm really not familiar with this. Who is developing it? (You?)

Thanks for the comment! If I understand correctly, this sounds sensible, but I'm really not familiar with this. Who is developing it? (You?)

It was my gsoc project in 2010, but I'm not maintaining it anymore really, so I think the answer is nobody... (https://github.com/wikimedia/XMPReader for reference). There is some complicating factors though, in that MW still needs to have those messages (For the non-XMP exif support) and have it integrated into MediaWiki namespace and friends. Not all of those message are related to XMP (but most are, and that probably doesn't matter). I guess the way to do that would be to have a separate library containing the messages that both XMPReader and MediaWiki depend on, and have some magic to make messages from this library show up in the MediaWiki namespace. So actually doing this might derail this task, which I wouldn't want. In any case splitting the exif messages into separate json file is definitely the first step towards doing something like that.

If you can keep the format of the i18n files, all that is needed to use them in MediaWiki is to have them registered in $wgMessagesDirs. They can live in the XMPReader repo, which is then brought into MediaWiki in some manner (composer?).

Change 481489 had a related patch set uploaded (by Amire80; owner: Amire80):
[translatewiki@master] Split exif messages from MediaWiki core

https://gerrit.wikimedia.org/r/481489

Change 458165 merged by jenkins-bot:
[mediawiki/core@master] Move exif messages to a separate i18n file

https://gerrit.wikimedia.org/r/458165

Change 481489 merged by jenkins-bot:
[translatewiki@master] Split exif messages from MediaWiki core

https://gerrit.wikimedia.org/r/481489

It looks like the TWN bot has now removed the EXIF messages from the non-English .json files in the main languages/i18n directory, but hasn't yet created any non-English .json files in the languages/i18n/exif directory?

Amire80 added a subscriber: Raymond.Jan 4 2019, 2:59 PM

It looks like the TWN bot has now removed the EXIF messages from the non-English .json files in the main languages/i18n directory, but hasn't yet created any non-English .json files in the languages/i18n/exif directory?

Indeed. @Raymond , @Nikerabbit , do you have any idea about this? Did I do anything incorrectly? It's a bit concerning to see this just a few days before the next train.

Legoktm added a subscriber: Legoktm.Jan 4 2019, 7:18 PM

If you can keep the format of the i18n files, all that is needed to use them in MediaWiki is to have them registered in $wgMessagesDirs. They can live in the XMPReader repo, which is then brought into MediaWiki in some manner (composer?).

Yeah, this seems pretty doable. My main concern is that the library is not updated on a regular basis as the PHP code is fairly stable, so new i18n messages wouldn't be pulled in unless someone does a release. What is the acceptable delay from message being translated to being deployed? We could probably automate the release process for i18n changes, but I don't think we want to be tagging a new release for every day's updates...

Change 482360 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[translatewiki@master] Export mediawiki-exif with core

https://gerrit.wikimedia.org/r/482360

Change 482360 merged by jenkins-bot:
[translatewiki@master] Export mediawiki-exif with core

https://gerrit.wikimedia.org/r/482360

Amire80 updated the task description. (Show Details)Jan 5 2019, 12:48 PM

It looks like the TWN bot has now removed the EXIF messages from the non-English .json files in the main languages/i18n directory, but hasn't yet created any non-English .json files in the languages/i18n/exif directory?

Indeed. @Raymond , @Nikerabbit , do you have any idea about this? Did I do anything incorrectly? It's a bit concerning to see this just a few days before the next train.

Fixed with https://gerrit.wikimedia.org/r/482360 and https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/482409/ . I also documented the fix in this task's description.

Amire80 updated the task description. (Show Details)

Change 542007 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] Split rest messages from the main en.json

https://gerrit.wikimedia.org/r/542007