Page MenuHomePhabricator

Explore translation options for Toolhub dynamic content
Closed, ResolvedPublic

Description

Toolhub is currently envisioned as a Django application rather than a MediaWiki extension. As such it will not have direct connection to the existing on-wiki translation communities on mediawiki.org and meta. We do however want to provide multi-lingual support in the project.

We should be able to register the Django project itself at translatewiki.net and receive user interface translations from that community. The bigger area in need of exploration is what if anything we can do to provide translations of toolinfo.json content strings as well as any additional "annotations" that are added by users via the Toolhub APIs and web interface.

Questions:

  • Is there a way to feed selected strings from the user generated content to translatewiki.net and get translations back?
    • Yes, we have identified 2 ways this might be accomplished: via a git repo of exported strings; via Action API calls to TWN. A "tech spike" is needed to provide a better idea of the feasibility of an API driven approach. We would need to make both TWN and ourselves comfortable with the latencies and load that an API based approach would place on TWN.
  • Is there a robust Django translation app that can be reused to provide in-app translations? Review https://github.com/bbmokhtari/django-translations and look for other competing solutions.
  • Can we leverage wikidata Qs and their translations as a way to localize content on the fly? This may be the nicest approach for "tag" or "category" elements in the system.

Event Timeline

bd808 triaged this task as High priority.Aug 6 2020, 8:43 PM
bd808 added subscribers: siebrand, Nikerabbit.

This feels like a pretty high priority thing to look into early in the code design phase. I don't think we need to have all the answers, but it would be good to have some general idea of what the initial approach will be for localization of the dynamic user generated content.

Reaching out to @Nikerabbit and @siebrand seems like a good place to start as they are both deep knowledge experts in localization and translation within the Wikimedia movement.

AFAIK Django supports gettext. translatewiki.net supports gettext. So I don’t really expect issues. Regular i18n concerns will always apply. See https://mediawiki.org/wiki/Localisation with some MediaWiki specific hints, but most are applicable to i18n of any code base.

I did a quick review of https://github.com/bbmokhtari/django-translations and found it to be useful, but a very low level building block. It provides extensions to Django's model layer which can be used to store and retrieve localized strings associated with a model instance. It also provides a basic administrative interface for recording translations. It does not however appear to offer any sort of translation memory or versioning to flag stale translations. Its data model is basically storing (model, field, language, value) tuples for each localized string.

https://github.com/deschler/django-modeltranslation is another Django app for translating models. It looks to be a more actively developed than django-translations, but has other drawbacks. The main one is that it stores localized strings in the same table as the base model using a <base field>_<lang code> naming convention. This in turn means that if you wanted to support 10 languages (a pretty small number in the Wikimedia world) the backing database table would end up needing 11 columns for each localizable element (for example: name, name_en, name_de, name_fr, name_ja, ...).

AFAIK Django supports gettext. translatewiki.net supports gettext. So I don’t really expect issues. Regular i18n concerns will always apply. See https://mediawiki.org/wiki/Localisation with some MediaWiki specific hints, but most are applicable to i18n of any code base.

Yes, I expect TWN integration with Django's gettext 'po' files to be relatively easy to setup.

The more interesting problem to consider is localization of dynamic content (think articles instead of messages). The toolinfo.json data for each tool contains some potentially localizable freeform text values (mostly "title", "subtitle", and "description" in the current spec). The application will also (eventually) allow some "annotations" to be added to the toolinfo data by the users of Toolhub. We don't have a solid specification for what annotations will be allowed, but it is reasonable to assume that there will be some freeform text fields involved here as well. The thing I want to investigate under this task is technical options for providing some means of translation for these free form text fields.

One random idea I floated in the initial description is building some export/import pipeline that would work to get translatable strings into TWN and then pull the translations back into the application. I'll do some hand waving here about what implementing that would take on the Toolhub side, but I think it is possible. A "direct" approach to this would be something like a daily dump/load cycle to a git repo in a TWN supported file format. I was worried that the number of messages that kind of export might generate could be too large for TWN to comfortably handle. Looking at the translation stats for MediaWiki however I see that it has more than 38K messages in total at TWN, so it feels like that concern is unfounded or would at least take a lot to toolinfo records to push the limits on the TWN side of things assuming that there are only a few freeform strings to translate per tool.

Can I dream of a somewhat more API based real-time integration?

Can I dream of a somewhat more API based real-time integration?

Sure! Dreams are what make us find better solutions right? I looked at the docs on TWN a bit and did not see an obvious page about using an API other than the git import/export system.

On the TWN side (as far as I understand it), the messages end up being normal pages like https://translatewiki.net/wiki/Wikimedia:Wsa-404-header/en, https://translatewiki.net/wiki/Wikimedia:Wsa-404-header/qqq, and https://translatewiki.net/wiki/Wikimedia:Wsa-404-header/pt. Could integration at an API level be as direct as using the Action API to create or update a message to be localized? That seems to be functionally what the import process for a git based source file does. I also see action=managemessagegroups in the API list which looks like it could be used to create new messages, but it is marked as "internal". Getting translated things back out of TWN looks to be as direct as calling action=query&meta=messagetranslations?

Related question: does TWN support setting the origin language per message or only per message group? Most (all?) of the toolinfo records I have seen myself are written in English, but as we get more usage of that specification through this new project I can certainly imagine that a tool primarily written for a Wikimedia project in language QQQ could (and probably should) write its documentation in that human language. (Note to self, this feels like a missing bit of metadata in the toolinfo 1.1.1 spec. There is a "supported languages" collection element, but not a "language" code for the json file itself which means we would have to heuristically guess the input language which is gross.)

bd808 renamed this task from Explore translation options for Toolhub records to Explore translation options for Toolhub dynamic content.Aug 6 2020, 11:26 PM

Can I dream of a somewhat more API based real-time integration?

Translatewiki.net is not suitable for on-demand fetching of translations through an API: Our setup is not high-availability nor designed with that kind of performance considerations in mind.

Is there a way to feed selected strings from the user generated content to translatewiki.net and get translations back?

From my point of view, the path of least resistance is (and this is of course up for discussion):

1. regular dump from toolhub database

  • Go over all the toolinfo entries, pick up the fields that contain translatable texts
  • Dump them in a en.json file with stable keys, e.g. toolname-fieldname

2. store the dump in a VCS repository

  • You can do this as often as you want, but currently we process incoming updates every two hours
  • "safe" updates are applied immediately (only additions), if there are non-safe updates (deletions, changes of content) they are applied after a human checks for need to fuzzy or rename

3. utilize translations from the same VCS repository

  • We will regularly push translations (currently twice a week, hoping to make it fully automated in the future)
  • You can either add glue code to read translations from json files directly, or add some process to convert them into a format and storage suitable for toolhub

This process will automatically handle additions, deletions and changes (with tracking which translations are outdated). It is zero extra overhead for us as all other projects follow the same pattern. This would be just one new message group in one new VCS repo, something we have thousands.

Can we leverage wikidata Qs and their translations as a way to localize content on the fly? This may be the nicest approach for "tag" or "category" elements in the system.

Yeah I would try this. I know it works for news articles. Not sure if there are tags/categories that would not make sense to include in Wikidata - that would be a problem.

Related question: does TWN support setting the origin language per message or only per message group?

Per message group is fully supported. All new code is written to support source language per message, but the system as a whole does not support it.

Can I dream of a somewhat more API based real-time integration?

Translatewiki.net is not suitable for on-demand fetching of translations through an API: Our setup is not high-availability nor designed with that kind of performance considerations in mind.

I certainly would not want to put undue stress on the TWN servers, but assuming we put reasonable caching in place from the Toolhub side I wonder how long the product would have to grow before performance actually became a concrete issue? It is hard to make educated guesses about page views per language or even total page views, but Toolhub is most definitely going to have access patterns more like a tool than a project wiki. By this I mean that page views are going to be on the order of N thousands per day rather than N thousands per second. I think it would be reasonable to think about caching lookups for up to 24 hours on the Toolhub side. I don't think it makes sense for the stability of either system to think about real time lookup of all messages.

Is there a way to feed selected strings from the user generated content to translatewiki.net and get translations back?

From my point of view, the path of least resistance is (and this is of course up for discussion):

nod. I read this list as a better described version of what I was thinking might work when I wrote T259838#6367427.

Related question: does TWN support setting the origin language per message or only per message group?

Per message group is fully supported. All new code is written to support source language per message, but the system as a whole does not support it.

This is something I have been thinking more and more about. I guess one thing we could do in the case of messages that start in a source language other than English is flag them on the Toolhub side and require an English translation to happen in Toolhub that could then be exported to TWN. Otherwise it sounds like we would need to either have a new message group on the TWN side for each origin language (which sounds like an operational pain on both sides) or work with others on getting the system as a whole to the point where understands per-message source languages (which sounds useful and also like non-trivial effort).

Is there a robust Django translation app that can be reused to provide in-app translations? Review https://github.com/bbmokhtari/django-translations and look for other competing solutions.

  • django-translations (https://github.com/bbmokhtari/django-translations) is ultimately one table holding (model, id, field, language, text) tuples and some fancy Django ORM glue to query that table for translated strings at model lookup time. I have some concerns about performance here, but they may be unfounded. We are not going to try and build a local translation community within the Toolhub app itself, but django-translations might end up being useful in integrating TWN provided translations with the application.
  • django-modeltranslation (https://github.com/deschler/django-modeltranslation) stores translated strings by adding a new database column for each (field, language) pair. With a potential of 210 languages for each translation, this feels like it would get out of hand really quickly on the database side. It would be more reasonable for use in a project that is only translating into a small number of languages.
  • django-parler (https://github.com/django-parler/django-parler) stores translations by making a separate translation model for each translatable model. These translation models contain the translatable fields plus a language code. This is a bit like the model used by django-translations, but with a separate lookup table for each model.
  • django-vinaigrette (https://github.com/ecometrica/django-vinaigrette) connects Django models to the GNU gettext system used by default in Django. It accomplishes this by extending Django's makemessages command to dump model data into the generated po files and extending the models registered for translation at runtime to lookup translated strings using gettext or pgettext as appropriate. This library's own documentation says "Vinaigrette is designed for database content that is: ... edited by site administrators, not users". The method of hooking translation lookups into the models that the library uses is interesting though and would be something to think about borrowing from if we end up rolling our own solution.
  • django-nece (https://github.com/tatterdemalion/django-nece) stores translations in a JSONField attached to the translatable model. This JSONField ends up holding a dict of dicts of translated strings where the top level keys are language codes and the second level keys are field names. This ends up working a lot like django-modeltranslation's column per (field, language) pairs but with the extra data stuffed in a blob field that requires database specific support to query. This one is out if for not other reason than it is Postgres specific.
  • django-i18nfield (https://github.com/raphaelm/django-i18nfield) is very similar to django-nece, but it uses a TextField for storage. This does not allow searching on translated strings via Django's ORM. It does however remove the Postgres runtime restriction of nece.
  • I found several other ancient, unmaintained Django projects related to dynamic translations: django-multilingual, django-model-i18n, transdb, django-multilingual-model, django-transmeta, django-hvad, django-multilingual-ng.

(Note to self, this feels like a missing bit of metadata in the toolinfo 1.1.1 spec. There is a "supported languages" collection element, but not a "language" code for the json file itself which means we would have to heuristically guess the input language which is gross.)

After closer review, the v1.1.1 schema does include this. The "toolinfo_language" property defaults to "en" but can be provided explicitly to document the natural language used in the toolinfo record's freeform text values.

bd808 claimed this task.

Preliminary research recorded at https://meta.wikimedia.org/wiki/Toolhub/Decision_record#Translations_for_dynamic_content.

Thank you @siebrand and @Nikerabbit for the input you have provided so far. Further investigation will happen in T263303: [Tech spike] Action API integration with TWN for dynamic content translations before we ultimately "answer" the core question here.