Page MenuHomePhabricator

Parameters matching on Templates: ML Exploration
Closed, ResolvedPublic


One challenge for the Content Translation tool is translate templates. While finding the template in target language can be done through Wikidata, finding the mapping between template parameters is not trivial. Currently this task is done heuristically. Here we are going o explore which possibilities offers ML and NLP to improve this task.

Problem: Giving a template with named parameters in language X, find the correspondent parameters in language Y.

Methodology: Considering the lack of labeled data (ground truth to learn from), we scope this as an unsupervised
1-to-1 matching problem. Although that in some cases the matching is 1-to-n or n-to-n (for example date can be spitted in day/month/year in one language and as one variable in another), for simplicity we focus in 1-to-1 matching.
Given this matching problem we explore three alternatives to match parameters:

  1. Metadata-based solution: We use metadata information contained in the Templata Data to create the matchings.
  2. Value-based solution: For each usage of a template, we check the values assigned to each parameter in languages X, and Y, assuming that same parameters must point to same values.
  3. Parameters named-based solution: We just compare parameters names in each language. (eg. Name in English will point to Nombre in Spanish).

For all cases, we need to compare pieces of text in two languages. To do so, we build our solution on top of aligned word-embeddings. Specifically, we use the Fasttextmultilingual embeddings, but creating our own Wikidata based alignments.

Evaluation: Given the lack of ground truth, we do a expert-based evaluation, given the results to the Language-Team

Code: Check in github.

Event Timeline

diego updated the task description. (Show Details)
diego triaged this task as High priority.May 16 2019, 4:21 PM
diego updated the task description. (Show Details)

You can find the results of the experiments here.

In summary:

  • Template data is incomplete, for example for the top-50 templates in Spanish (the most used ones), the 46% has no template data information. So, using this metedata is not a good option.
  • The Parameters name-based solution shows to be strong and simple enough. Check the examples of results of Spanish to English alignments.

We have agreed with the Language-Team to use the latter strategy, and run it for the top-15 most common languages pairs in the CX tool:

en -> es (36543 translations)
en -> fr (29655 translations)
es -> ca (19600 translations)
en -> ar (15772 translations)
ru -> uk (15352 translations)
en -> pt (13676 translations)
en -> vi (10620 translations)
en -> zh (10087 translations)
en -> ru (9557 translations)
en -> he (9512 translations)
en -> it (9424 translations)
en -> ta (9259 translations)
en -> id (9008 translations)
en -> fa (8811 translations)
en -> ca (8512 translations)

I'm currently creating the alignment vectors for those pairs.

I was wondering, how well does the parameters name-based approach apply to the set of TemplateData-backed templates themselves?

Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.

I was wondering, how well does the parameters name-based approach apply to the set of TemplateData-backed templates themselves?

About this, if you use just the parameters names, the results would be almost the same, a bit better, because you have less noise (typos or errors in parameters name). So, in general I would say that precision might be better, but you would have less templates.

Maybe the right solution is to use the TemplateData-backed templates every-time they exist, and the other solution for the rest.

When we discussed labelled data about templates we discussed templateData mainly, which is often incomplete. Another source of multilingual metadata is Wikidata labels. For example, the author parameter for a template in French ("auteur") may be mapped to the Wikidata P50 (author) property since it contains a French version of the label. However the naming schema used for the templates may not match the Wikidata properties or how those are translating. So I don't know how useful this can be as an additional source of information to improve the magic of aligning words in a multi-dimensional space.

@JKatzWMF was pointing to the role of Wikidata as a way to help with template mapping, and although I think the comment was mainly about editors providing such metadata (T221534#5221573), I thought it was worth to also mention it related to our automatic approach.

I have created and uploaded the full experiments and aligned parameters for these languages:

["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]

I've choose those languages based on @Pginer-WMF requests for the top 15 languages pairs in the translation tool. I've included those 15 pairs, plus all the combinations between the languages on that list. I'm not sure if the other alignments would be useful, but was trivial to add them.

You can find all the alignments here.

All the code and details about how to reproduce them are in this repository.

For reproducing this alignments or create new ones, I have added the instructions on the repository. Just keep in mind that:

  • You need to consider around 10GB of HD per language included. This is the space needed for the fasttext models.
  • You also need enough RAM, considering that you need to keep on memory two models at the same time.
  • I've done all the experiments on stat1007, I still have the the models there. Currently, I'm using 128G to store the full data. I can move this to a more proper host if needed. Maybe @elukey can suggest the best place to keep and work with this data. I don't think that we need to run this process often, but for sure I'll keep the data and models in some safe place. Maybe we would also want to add more languages in the near future (@Pginer-WMF @santhosh ?)
  • There a list of hyperparamters that I have assigned arbitrarily and hard-coded them on the scripts. With some ground-truth data (human annotated alignments) we could learn there, improving the quality of the alignments. That list is the repository's

Thanks @diego. This is great!

I created a follow-up ticket (T224721) for the Language team to integrate the alignments in Content translation to improve the way templates are handled by the tool. This will allow to see the effects in practice.

@KartikMistry may be interested to take a look to this, since we may want to automate this process in order to run it twice a year to update the alignments and expand to new languages.

@diego I was processing the JSON files and trying to understand the values.

In the above mapping, I expect params that are literally same should have a score(annotated as d in json) either high or 0 if d means distance in vector space.

Another one:

Can you please help understanding the values here?

If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.

I will clarify my question. We wanted to use a threshold score. Any mapping below that score wont be used. When referimento and referencia has a d value 0.40, and nom, nombre has 0.29, What does these numbers mean? Can they be interpreted as quality of match between 0 and 1? Which is best matching, which is bad matching based on these values? Are they really an indicator of good or bad matching or just an internal system value that is not useful?

oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

Yes. Thanks for the great work, @diego.
Regarding follow-up tickets:

@diego @Pginer-WMF thanks for the earlier replies on this task. I had some annual planning and vacation stuff that drew me away for a while, but just wanted to say thanks!