Page MenuHomePhabricator

Parameters matching on Templates: ML Exploration
Closed, ResolvedPublic

Description

One challenge for the Content Translation tool is translate templates. While finding the template in target language can be done through Wikidata, finding the mapping between template parameters is not trivial. Currently this task is done heuristically. Here we are going to explore which possibilities offers ML and NLP to improve this task.

Problem: Giving a template with named parameters in language X, find the correspondent parameters in language Y.

Methodology: Considering the lack of labeled data (ground truth to learn from), we scope this as an unsupervised
1-to-1 matching problem. Although that in some cases the matching is 1-to-n or n-to-n (for example date can be spitted in day/month/year in one language and as one variable in another), for simplicity we focus in 1-to-1 matching.
Given this matching problem we explore three alternatives to match parameters:

  1. Metadata-based solution: We use metadata information contained in the Templata Data to create the matchings.
  2. Value-based solution: For each usage of a template, we check the values assigned to each parameter in languages X, and Y, assuming that same parameters must point to same values.
  3. Parameters named-based solution: We just compare parameters names in each language. (eg. Name in English will point to Nombre in Spanish).

For all cases, we need to compare pieces of text in two languages. To do so, we build our solution on top of aligned word-embeddings. Specifically, we use the Fasttextmultilingual embeddings, but creating our own Wikidata based alignments.

Evaluation: Given the lack of ground truth, we do a expert-based evaluation, given the results to the Language-Team

Code: Check in github.

Related Objects

Mentioned In
T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400
T351882: Enable Section translation on Wikipedias with Content Translation available as default
T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT
T343211: Enable Content and Section translation on 7 Wikipedias
T338123: Enable MinT, Content and Section Translation for a 4th group of languages previously lacking machine translation
T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation
T337669: Enable MinT, Content and Section Translation for a 2nd group of 10 languages previously lacking machine translation
T337290: Enable MinT, Content and Section Translation for 10 languages previously lacking machine translation
T334915: Enable Content and Section translation on 4 Wikipedias
T333116: Enable Section Translation in Spanish Wikipedia
T332197: Enable Content and Section translation on Latin, Russian and, Ukrainian Wikipedias
T330066: Enable Content and Section translation on Wikipedias for languages to be supported by NLLB in the future
T329812: Enable Content and Section translation for Uyghur Wikipedia
T326541: Enable Section Translation on Kashmiri Wikipedia
T327102: Enable Content and Section translation on 6 Wikipedias
T325765: EXAMPLE CRS support task (child): Announce the launch of the Collaborative Editor project
T323825: Enable Content and Section translation on 8 Wikipedias
T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default
T319176: Enable Section Translation on 9 Wikipedias where Content Translation is available by default
T319175: Enable Content and Section translation on 6 more Wikipedias
T317289: Enable Content and Section translation on Hawaiian, Pashto, and Xhosa Wikipedias
T314557: Enable Content and Section translation on wikipedias with new MT support from Google for languages once it is working
T313300: Enable Section Translation on 9 more Wikipedias where Content Translation is available by default
T313296: Enable Content and Section translation on wikipedias with new MT support from Google
T310116: Enable Section Translation in Uzbek Wikipedia
T309384: Enable Content and Section translation on wikipedias with new MT support from Flores
T308834: Enable Section Translation on some wikis while Content Translation remains in beta
T308829: Enable Section Translation on 10 Wikipedias where Content Translation is available by default
T304866: Enable Content and Section Translation for Central Kurdish Wikipedia
T304865: Enable Content and Section Translation for Cantonese Wikipedia
T304863: Enable Content and Section Translation for Hebrew Wikipedia
T304862: Enable Content and Section Translation for Basque Wikipedia
T304858: Enable Content and Section Translation for Serbian Wikipedia
T304855: Enable Content and Section Translation for Czech Wikipedia
T304854: Enable Content and Section Translation for Greek Wikipedia
T304853: Enable Content and Section Translation for Turkish Wikipedia
T296475: Enable Content and Section Translation for Persian Wikipedia
T304834: Enable Content and Section Translation for Zulu Wikipedia
T298239: Enable Content and Section Translation for Korean Wikipedia
T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default
T290847: Generate template parameter alignments for languages of interest to Section Translation
T286473: Generate template parameter alignments for additional wikis
T248096: Provide two columns template/references editing
T221534: Define template parameter mapping between languages as a wiki page
T230348: What are your experiences with templates?
T227183: Generate template parameter alignments for the selected small wikis
T224721: Integrate template parameter alignments in Content Translation to improve automatic template support
T224234: Research support for cross-wiki content propagation
Mentioned Here
T227183: Generate template parameter alignments for the selected small wikis
T224721: Integrate template parameter alignments in Content Translation to improve automatic template support
T221534: Define template parameter mapping between languages as a wiki page
P50 wmf config inherit settings

Event Timeline

diego updated the task description. (Show Details)
diego triaged this task as High priority.May 16 2019, 4:21 PM
diego updated the task description. (Show Details)

You can find the results of the experiments here.

In summary:

  • Template data is incomplete, for example for the top-50 templates in Spanish (the most used ones), the 46% has no template data information. So, using this metedata is not a good option.
  • The Parameters name-based solution shows to be strong and simple enough. Check the examples of results of Spanish to English alignments.

We have agreed with the Language-Team to use the latter strategy, and run it for the top-15 most common languages pairs in the CX tool:

en -> es (36543 translations)
en -> fr (29655 translations)
es -> ca (19600 translations)
en -> ar (15772 translations)
ru -> uk (15352 translations)
en -> pt (13676 translations)
en -> vi (10620 translations)
en -> zh (10087 translations)
en -> ru (9557 translations)
en -> he (9512 translations)
en -> it (9424 translations)
en -> ta (9259 translations)
en -> id (9008 translations)
en -> fa (8811 translations)
en -> ca (8512 translations)

I'm currently creating the alignment vectors for those pairs.

I was wondering, how well does the parameters name-based approach apply to the set of TemplateData-backed templates themselves?

Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.

I was wondering, how well does the parameters name-based approach apply to the set of TemplateData-backed templates themselves?

About this, if you use just the parameters names, the results would be almost the same, a bit better, because you have less noise (typos or errors in parameters name). So, in general I would say that precision might be better, but you would have less templates.

Maybe the right solution is to use the TemplateData-backed templates every-time they exist, and the other solution for the rest.

When we discussed labelled data about templates we discussed templateData mainly, which is often incomplete. Another source of multilingual metadata is Wikidata labels. For example, the author parameter for a template in French ("auteur") may be mapped to the Wikidata P50 (author) property since it contains a French version of the label. However the naming schema used for the templates may not match the Wikidata properties or how those are translating. So I don't know how useful this can be as an additional source of information to improve the magic of aligning words in a multi-dimensional space.

@JKatzWMF was pointing to the role of Wikidata as a way to help with template mapping, and although I think the comment was mainly about editors providing such metadata (T221534#5221573), I thought it was worth to also mention it related to our automatic approach.

Hi,
I have created and uploaded the full experiments and aligned parameters for these languages:

["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]

I've choose those languages based on @Pginer-WMF requests for the top 15 languages pairs in the translation tool. I've included those 15 pairs, plus all the combinations between the languages on that list. I'm not sure if the other alignments would be useful, but was trivial to add them.

You can find all the alignments here.

All the code and details about how to reproduce them are in this repository.

For reproducing this alignments or create new ones, I have added the instructions on the repository. Just keep in mind that:

  • You need to consider around 10GB of HD per language included. This is the space needed for the fasttext models.
  • You also need enough RAM, considering that you need to keep on memory two models at the same time.
  • I've done all the experiments on stat1007, I still have the the models there. Currently, I'm using 128G to store the full data. I can move this to a more proper host if needed. Maybe @elukey can suggest the best place to keep and work with this data. I don't think that we need to run this process often, but for sure I'll keep the data and models in some safe place. Maybe we would also want to add more languages in the near future (@Pginer-WMF @santhosh ?)
  • There a list of hyperparamters that I have assigned arbitrarily and hard-coded them on the scripts. With some ground-truth data (human annotated alignments) we could learn there, improving the quality of the alignments. That list is the repository's Readme.md.

Thanks @diego. This is great!

I created a follow-up ticket (T224721) for the Language team to integrate the alignments in Content translation to improve the way templates are handled by the tool. This will allow to see the effects in practice.

@KartikMistry may be interested to take a look to this, since we may want to automate this process in order to run it twice a year to update the alignments and expand to new languages.

@diego I was processing the JSON files and trying to understand the values.

image.png (342×659 px, 57 KB)

In the above mapping, I expect params that are literally same should have a score(annotated as d in json) either high or 0 if d means distance in vector space.

Another one:

image.png (440×646 px, 71 KB)

Can you please help understanding the values here?

If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.

I will clarify my question. We wanted to use a threshold score. Any mapping below that score wont be used. When referimento and referencia has a d value 0.40, and nom, nombre has 0.29, What does these numbers mean? Can they be interpreted as quality of match between 0 and 1? Which is best matching, which is bad matching based on these values? Are they really an indicator of good or bad matching or just an internal system value that is not useful?

oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

Yes. Thanks for the great work, @diego.
Regarding follow-up tickets:

@diego @Pginer-WMF thanks for the earlier replies on this task. I had some annual planning and vacation stuff that drew me away for a while, but just wanted to say thanks!