Page MenuHomePhabricator

Add Wikidata Lexeme Forms to translatewiki.net
Closed, ResolvedPublic

Description

Project information

Name: Wikidata Lexeme Forms
Homepage: https://lexeme-forms.toolforge.org/
Project link: https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms, maybe? I’m not sure what the difference between Homepage and Project link is supposed to be, to be honest.
Code repository:

OS License: AGPLv3+ for source code, CC BY-SA 3.0 for translations
Issue Tracker: Mixture of different channels, so far; https://github.com/lucaswerkmeister/tool-lexeme-forms/issues might be the best one.
Project contact: User:Lucas Werkmeister

Logo:

  • Without text: None so far.
  • With text: Ditto.

Project description:
Wikidata Lexeme Forms is a tool to create and edit the Forms of Wikidata Lexemes, based on templates that define what different kind of Lexemes look like – for instance, there are forms for English nouns, German verbs, Bengali nouns (animate), etc. The templates are maintained on wiki pages (see Wikidata:Wikidata Lexeme Forms#Language support), and currently, translations are maintained on the same pages. Both the templates and the translations are manually synchronized from the wiki pages to the tool’s source code (Python) by the developers (mainly me).

The subject of moving those translations to translatewiki.net has been brought up before (cc @Amire80), but I don’t even remember where that was, so I figure it makes sense to kick it off again here. From my point of view as the developer of the tool, the main expected benefit would be to make it easier to add new messages. Currently, if I add a new feature (or some other improvement, e.g. nicer error pages) that introduces a new message, this will require me to add the message to all the language pages on-wiki, so that it can be translated; I’m starting to notice that the prospect of having to update dozens of wiki pages discourages me from such new developments, which doesn’t seem like a healthy incentive. I hope that moving the translations to translatewiki.net will remove that barrier. From a translator’s point of view, I’m told that editing translatewiki.net is more convenient than the current workflow.

The main problem I see is the message syntax. The tool is written in Python, so the standard MediaWiki formatting ({{PLURAL:}}, {{GENDER:}}) is not available; I wrote some Python formatters that recognize a more Pythonic syntax, and manually translate the MediaWiki syntax to the Python one when moving the translations to the source code, for instance:

$1, $2, {{PLURAL:$3|0=no statements|one statement|$3 statements}}

{form_link}, {grammatical_feature_labels!l}, {statements!p:0=no statements:one=one statement:other={statements} statements}

Ci dispiace, ma non sei {{GENDER:$1|autorizzato|autorizzata|autorizzato/a}} a usare il caricamento di massa.

Ci dispiace, ma non sei {user!g:m=autorizzato:f=autorizzata:n=autorizzato/a} a usare il caricamento di massa.

The first example demonstrates plural (!p, {{PLURAL:}}) and list (!l, no MediaWiki equivalent) handling, whereas the second example demonstrates gender (!g, {{GENDER:}}) handling. All Python examples also use variable names (form_link etc.) rather than numeric indices ($1, $2 etc.), though it’s probably possible to migrate the Python code to positional arguments if adding the variable names to the translatewiki.net export proves unfeasible.

Is it possible to make translatewiki.net translate the MediaWiki syntax to the Python syntax on export? If not, is it acceptable to require translators to write the Python syntax directly when translating?

Another peculiarity of this tool is that the translations are directly tied to the template: there’s no way to select the user interface language – rather, the interface language is the same as the language of the template you’re currently using. (The index page, where the template is selected, has no interface messages at all; the nav bar at the top is always in English.) For this reason, I think it would make sense to limit translations into new languages on translatewiki.net: there’s no point in people spending time on writing, say, Japanese translations of all messages, if nobody is going to see them because the tool has no Japanese templates yet.

NOTE: Section below will be filled by twn staff

Project setup checklist

Project configuration (for translation admins)

Namespace: NS_WIKIMEDIA
Prefix: wikidata-lexeme-forms-
Validators:

  1. MediaWikiPlural
  2. MediaWikiParameter
  3. HtmlTagInsertablesSuggester (Insertables)

Concerns

Event Timeline

Thanks for the very detailed request. A few quick comments and questions:

Is it possible to make translatewiki.net translate the MediaWiki syntax to the Python syntax on export?

Both parsing the files and doing structural transformations for message contents (and keys) is done by our file format support classes, and they are accompanied by appropriate insertables and validators.

The question here is rather who is going to write that code (in PHP) and maintain it, and how complex it is, and is the format unique to this project or is it reusable for others as well. A specification of some sort for the syntax would be a good starting point.


If not, is it acceptable to require translators to write the Python syntax directly when translating?

It looks rather complicated, and slightly different from all the other formats. This increases the changes that translators will make mistakes, though validators may help.


I see translations are currently here: https://github.com/lucaswerkmeister/tool-lexeme-forms/blob/main/translations.py. Would it be possible to have them in one per language? Also something like JSON would be easier to parse.


I think it would make sense to limit translations into new languages on translatewiki.net

We can set the list of allowed languages in the message group configuration. The question would be how to keep it up to date. Would you be willing to submit patches to add languages?

I think it would make sense to limit translations into new languages on translatewiki.net

We can set the list of allowed languages in the message group configuration. The question would be how to keep it up to date. Would you be willing to submit patches to add languages?

I was confused initially about this, too, but now that I've carefully read what Lucas wrote, I think that what he's trying to say is that translating into new languages shouldn't be allowed until the software explicitly supports them. This makes sense. I'm pretty sure that there is such a setting in the YAML group definition files. Existing translations will be imported into translatewiki, translating into them will be allowed. When the tool starts supporting Japanese, YAML will be updated to allow Japanese. Does this make sense?

I see translations are currently here: https://github.com/lucaswerkmeister/tool-lexeme-forms/blob/main/translations.py. Would it be possible to have them in one per language? Also something like JSON would be easier to parse.

Sure, not a problem.

The question here is rather who is going to write that code (in PHP) and maintain it, and how complex it is, and is the format unique to this project or is it reusable for others as well.

I looked a bit at the PHP code, and maybe we don’t need custom code for this tool at all? Apparently the JsonFFS employs a class called ArrayFlattener, which has the ability to parse CLDR plural syntax in both directions; if translatewiki.net exports this parsed CLDR syntax into the JSON file, I can probably turn that JSON back into Python-style messages on my end. We would need a corresponding option to parse the {{GENDER:}} magic word into a structured form as well, but that might be useful for other tools too?

A specification of some sort for the syntax would be a good starting point.

The formatters have some documentation in formatters.py; apart from that, the messages are plain text – there has been no need for any kind of markup (wikitext, markdown, …) so far.

The formatters have some documentation in formatters.py; apart from that, the messages are plain text – there has been no need for any kind of markup (wikitext, markdown, …) so far.

I just remembered that that’s not true – one message contains an <abbr> element. (In other words, after applying the formatters, the result is taken to be HTML.)


I just pushed an i18n branch to Diffusion and GitHub which saves all the messages in JSON files in an i18n/ directory, in MediaWiki format; the tool then converts the syntax back to Python syntax when loading the files. This should hopefully mean that no special handling on translatewiki.net’s side will be necessary.

Any chance you can take a look at that i18n branch and see if the message files look alright?

Is there anything I can do to help move this along? As far as I can tell, with the work in the i18n branch, no tool-specific changes to translatewiki.net should be needed anymore – the JSON files should use the same format as everywhere else.

@LucasWerkmeister - Apologies for the delay, I will take a look at this today or tomorrow.

abi_ triaged this task as Medium priority.
abi_ updated the task description. (Show Details)

@LucasWerkmeister I've added some concerns, please let me know if they make sense.

Thanks!

  1. I forgot about qqq.json, I added some now. (I also noticed the i18n files were using underscores in the message keys, when hyphens would probably be better for you. Changed to hyphens now.)
  2. Is this a requirement? I would prefer if translatewiki opened pull requests against the repository, rather than pushing directly. I need to manually do something to deploy the changes in either way, and this way, test failures due to translation issues won’t immediately break the build on the main branch.

@LucasWerkmeister

  1. Thanks, this is good to start with. Hypens are preferred but translatewiki handles underscores as well.
  2. We would still need push access to the repo. The bot can push to the twn branch, and create a PR to the main branch. The main branch could be marked as protected.

Change 667760 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[translatewiki@master] Add Wikidata Lexeme Forms to translatewiki.net

https://gerrit.wikimedia.org/r/667760

  1. Alright, that makes sense. (I’m more used to opening pull requests from forks, but I guess that wouldn’t scale well for you – I didn’t consider that.) Access should be granted now.

We've received the invite. Niklas has permissions to accept invites, but he is currently on leave. Let's wait till 8th March, 2021.

@LucasWerkmeister - We are a go from our side once you merge i18n branch to the main branch.

We will read translations from the main branch, but push them to twn branch.

Mentioned in SAL (#wikimedia-cloud) [2021-03-08T11:43:47Z] <wm-bot> <lucaswerkmeister> deployed ea7cd3ac71 (i18n from translatewiki.net – T272243)

Done, thank you! I’ll update the wiki pages on Wikidata now.

I just noticed that https://translatewiki.net/w/i.php?title=Special:Translate&group=wikidata-lexeme-forms&language=ku says:

This code is for compatibility purposes only. Localise in 'ku-latn'

Should I make the tool use ku-latn instead ku? (Similarly for tg/tg-cyrl, though I haven’t imported that language’s translations at all yet.)

Side note: the project page says “license: AGPL 3 or later”, but so far I’ve treating the translations and templates of the tool as CC BY-SA 3.0 (since they’re defined on-wiki), and AGPL3 is only for code. Is that field meant to document the license of the translations or of the rest of the software?

I apologize if it was not clear enough; but the language code mapping is done on translatewiki.net. I'll be updating the patch to take care of it.

Language code mapping that we will be using:

aeb-latn: aeb
 bbc-latn: bbc
 gan-hant: gan
 gom-latn: gom
 hif-latn: hif
 ike-cans: iu
 kbd-cyrl: kbd
 kk-cyrl: kk
 ks-arab: ks
 ku-latn: ku
 ruq-latn: ruq
 sr-ec: sr
 tg-cyrl: tg
 tt-cyrl: tt
 ug-arab: ug
 zh-hans: zh

That means that aeb-latn from translatewiki.net will be exported out as aeb and while reading files from the repository, translatewiki.net will look for aeb and map that to aeb-latn.

Alright, thanks! I just saw the “Pywikibot” in the URL / path and didn’t notice that it was translatewiki.net config ^^

I’ve pushed a revert of that commit; once Diffusion is done processing it, it’ll appear at R2362:b7b55e1b3372: Revert "Fix ku and tg language codes for translatewiki.net".

Side note: the project page says “license: AGPL 3 or later”, but so far I’ve treating the translations and templates of the tool as CC BY-SA 3.0 (since they’re defined on-wiki), and AGPL3 is only for code. Is that field meant to document the license of the translations or of the rest of the software?

The common assumption is that translations are dual licensed under CC BY and the project's overall license(s). This is mentioned in our about page:

Translations by translators are licensed CC BY 3.0, and derivative works may also be licensed under the licenses of the respective Free and Open Source projects the translations have been or will be added to.

In this case the license for translations is different from the code, but given they are already under CC BY, I think there is no copyright issue. If you want to keep it this way, we can amend the project page to list licenses for code and translations separately to avoid confusion.

Thanks, I didn’t know that. In the case of the existing translations (that will now be imported into translatewiki.net), they were contributed on Wikidata under CC BY-SA 3.0 exclusively, and according to Creative Commons’ Compatible Licenses page, that license is not one-way compatible with any other license (unlike CC BY-SA 4.0, which allows licensing under GPLv3… but not AGPLv3, nor GPLv3 or later? it’s confusing). So if I understand this correctly, while future contributions on translatewiki.net might be licensed under AGPLv3 (thanks to the note in the about page), the whole of the translations will still have to remain CC BY-SA 3.0 only, unless we ask the authors of the pre-translatewiki.net translations to relicense, or discard the old translations. Amending the project page sounds like the best option to me.

I've mentioned the project and translation licenses separately on the project page.

Change 667760 merged by jenkins-bot:
[translatewiki@master] Add Wikidata Lexeme Forms to translatewiki.net

https://gerrit.wikimedia.org/r/667760

The project is now available for translation at: https://translatewiki.net/w/i.php?title=Special:Translate&group=wikidata-lexeme-forms

Translations will be exported out from translatewiki.net on Thursday, 11th March, 2021.

Excellent, thanks a lot! I look forward to the first export :)

Hm, I tried pushing some extra commits to the twn branch (to fix the failing CI), but they’re not showing up in the pull request. Any ideas why? (I can see the commits if I look at the twn branch outside the pull request.)

Ah, now it’s working. Maybe GitHub had a hiccup (there’s an incident with potentially matching timestamps, I’m too lazy to double-check).

If you try it in the translation editor you can see that GENDER is not an issue. The issue is that the variable $1 is not known. In MediaWiki this is usually by having no-op GENDER in the definition. We can also disable check for variables (unknown & missing) for the group or per message.

You mean like this?

{{GENDER:$1|You}} are not allowed to use bulk mode. Sorry.

Alright, thanks. I updated the source message and will try how the translation editor behaves tomorrow, and if that looks alright, then I think we can close this issue.

Well, it looks like people are adding unnecessary {{GENDER}} to messages now, presumably because if they don’t the tool warns about the unused $1 parameter:

Screenshot from 2021-03-16 20-49-59.png (342×597 px, 37 KB)

Can you disable that check for this message? It’s expected that most languages won’t need $1.

Yes, I'll update the configuration to ignore this validation

Change 674022 had a related patch set uploaded (by Abijeet Patro; owner: Abijeet Patro):
[translatewiki@master] Remove parameter validation for Wikidata-lexeme-forms-bulk-not-allowed

https://gerrit.wikimedia.org/r/674022

Change 674022 merged by jenkins-bot:

[translatewiki@master] Remove parameter validation for Wikidata-lexeme-forms-bulk-not-allowed

https://gerrit.wikimedia.org/r/674022

I deployed the configuration change, but due to a bug it is not taking effect yet. The fix is going to be deployed this Wednesday.

Alright, I’ve removed those unused {{GENDER}}s from most translations now.