⚓ T221211 Parameters matching on Templates: ML Exploration

diego created this task.Apr 17 2019, 9:43 AM

diego moved this task from Backlog to In Progress on the Research board.Apr 22 2019, 4:27 PM

diego updated the task description. (Show Details)Apr 29 2019, 3:55 PM

diego updated the task description. (Show Details)

diego triaged this task as High priority.May 16 2019, 4:21 PM

Adding Language-Team tags for visibility in our phabricator boards

diego updated the task description. (Show Details)May 20 2019, 9:57 AM

diego updated the task description. (Show Details)

You can find the results of the experiments here.

In summary:

Template data is incomplete, for example for the top-50 templates in Spanish (the most used ones), the 46% has no template data information. So, using this metedata is not a good option.
The Parameters name-based solution shows to be strong and simple enough. Check the examples of results of Spanish to English alignments.

We have agreed with the Language-Team to use the latter strategy, and run it for the top-15 most common languages pairs in the CX tool:

en -> es (36543 translations)
en -> fr (29655 translations)
es -> ca (19600 translations)
en -> ar (15772 translations)
ru -> uk (15352 translations)
en -> pt (13676 translations)
en -> vi (10620 translations)
en -> zh (10087 translations)
en -> ru (9557 translations)
en -> he (9512 translations)
en -> it (9424 translations)
en -> ta (9259 translations)
en -> id (9008 translations)
en -> fa (8811 translations)
en -> ca (8512 translations)

I'm currently creating the alignment vectors for those pairs.

Pginer-WMF mentioned this in T224234: Research support for cross-wiki content propagation.May 23 2019, 4:05 PM

dr0ptp4kt subscribed.May 23 2019, 4:32 PM

I was wondering, how well does the parameters name-based approach apply to the set of TemplateData-backed templates themselves?

Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.

In T221211#5208601, @dr0ptp4kt wrote:

I was wondering, how well does the parameters name-based approach apply to the set of TemplateData-backed templates themselves?

About this, if you use just the parameters names, the results would be almost the same, a bit better, because you have less noise (typos or errors in parameters name). So, in general I would say that precision might be better, but you would have less templates.

Maybe the right solution is to use the TemplateData-backed templates every-time they exist, and the other solution for the rest.

Pginer-WMF moved this task from Needs Triage to Upstream/Other teams on the ContentTranslation board.May 24 2019, 12:36 PM

When we discussed labelled data about templates we discussed templateData mainly, which is often incomplete. Another source of multilingual metadata is Wikidata labels. For example, the author parameter for a template in French ("auteur") may be mapped to the Wikidata P50 (author) property since it contains a French version of the label. However the naming schema used for the templates may not match the Wikidata properties or how those are translating. So I don't know how useful this can be as an additional source of information to improve the magic of aligning words in a multi-dimensional space.

@JKatzWMF was pointing to the role of Wikidata as a way to help with template mapping, and although I think the comment was mainly about editors providing such metadata (T221534#5221573), I thought it was worth to also mention it related to our automatic approach.

Hi,
I have created and uploaded the full experiments and aligned parameters for these languages:

["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]

I've choose those languages based on @Pginer-WMF requests for the top 15 languages pairs in the translation tool. I've included those 15 pairs, plus all the combinations between the languages on that list. I'm not sure if the other alignments would be useful, but was trivial to add them.

You can find all the alignments here.

All the code and details about how to reproduce them are in this repository.

For reproducing this alignments or create new ones, I have added the instructions on the repository. Just keep in mind that:

You need to consider around 10GB of HD per language included. This is the space needed for the fasttext models.
You also need enough RAM, considering that you need to keep on memory two models at the same time.
I've done all the experiments on stat1007, I still have the the models there. Currently, I'm using 128G to store the full data. I can move this to a more proper host if needed. Maybe @elukey can suggest the best place to keep and work with this data. I don't think that we need to run this process often, but for sure I'll keep the data and models in some safe place. Maybe we would also want to add more languages in the near future (@Pginer-WMF @santhosh ?)
There a list of hyperparamters that I have assigned arbitrarily and hard-coded them on the scripts. With some ground-truth data (human annotated alignments) we could learn there, improving the quality of the alignments. That list is the repository's Readme.md.

Pginer-WMF mentioned this in T224721: Integrate template parameter alignments in Content Translation to improve automatic template support.May 31 2019, 9:54 AM

Thanks @diego. This is great!

I created a follow-up ticket (T224721) for the Language team to integrate the alignments in Content translation to improve the way templates are handled by the tool. This will allow to see the effects in practice.

@KartikMistry may be interested to take a look to this, since we may want to automate this process in order to run it twice a year to update the alignments and expand to new languages.

@diego I was processing the JSON files and trying to understand the values.

In the above mapping, I expect params that are literally same should have a score(annotated as d in json) either high or 0 if d means distance in vector space.

Another one:

Can you please help understanding the values here?

If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.

I will clarify my question. We wanted to use a threshold score. Any mapping below that score wont be used. When referimento and referencia has a d value 0.40, and nom, nombre has 0.29, What does these numbers mean? Can they be interpreted as quality of match between 0 and 1? Which is best matching, which is bad matching based on these values? Are they really an indicator of good or bad matching or just an internal system value that is not useful?

oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.

Pginer-WMF mentioned this in T227183: Generate template parameter alignments for the selected small wikis.Jul 3 2019, 10:43 AM

Pginer-WMF edited projects, added Language-Team (Language-2019-July-September); removed Language-Team (Language-2019-April-June).Jul 9 2019, 1:47 PM

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

diego closed this task as Resolved.Jul 10 2019, 11:24 AM

In T221211#5320582, @diego wrote:

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

Yes. Thanks for the great work, @diego.
Regarding follow-up tickets:

T224721: Integrate template parameter alignments in Content Translation to improve automatic template support. We are already working on this, and expecting to have the mappings live in Content translation soon.
T227183: Generate template parameter alignments for the selected small wikis. We may need your input if we have questions when dealing with the scripts to generate new mappings.

@diego @Pginer-WMF thanks for the earlier replies on this task. I had some annual planning and vacation stuff that drew me away for a while, but just wanted to say thanks!

Pginer-WMF updated the task description. (Show Details)Jul 12 2019, 11:17 AM

• Petar.petkovic moved this task from Backlog to Done on the Language-Team (Language-2019-July-September) board.Jul 12 2019, 3:32 PM

diego mentioned this in T230348: What are your experiences with templates?.Aug 14 2019, 8:31 AM

awight mentioned this in T221534: Define template parameter mapping between languages as a wiki page.Aug 27 2019, 8:13 AM

Pginer-WMF mentioned this in T248096: Provide two columns template/references editing.Mar 20 2020, 1:13 PM

Pginer-WMF mentioned this in T286473: Generate template parameter alignments for additional wikis.Jul 12 2021, 12:52 PM

Pginer-WMF mentioned this in T290847: Generate template parameter alignments for languages of interest to Section Translation.Sep 13 2021, 9:55 AM

Pginer-WMF mentioned this in T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default.Mar 28 2022, 10:14 AM

Pginer-WMF mentioned this in T298239: Enable Content and Section Translation for Korean Wikipedia.Mar 28 2022, 10:59 AM

Pginer-WMF mentioned this in T304834: Enable Content and Section Translation for Zulu Wikipedia.Mar 28 2022, 12:30 PM

Pginer-WMF mentioned this in T296475: Enable Content and Section Translation for Persian Wikipedia.Mar 28 2022, 12:35 PM

Pginer-WMF mentioned this in T304853: Enable Content and Section Translation for Turkish Wikipedia.Mar 28 2022, 3:52 PM