Page MenuHomePhabricator

Explore and evaluate template parameter alignment based on language agnostic embedding similarity
Open, Needs TriagePublic

Description

The template adaptation between two languages in cxserver is done as below

  1. Find the corrsponding template in target language using wikidata connections
  2. If the target template does not exist,exit and inform user in UI
  3. If the target template exist, extract parameters from source and target templates
    1. If templatedata for the template exists, use that to extract template parameters
    2. If templatedata does not exist, try extracting params from the source code of the template using regex
  4. Find alignment between these two sets of parameters by
    1. String matching heuristics - exact match, case insensive match, remove punctuations and symbols and compare etc
    2. A template alignment database that research team created for us based on multilingual fasttext model based parameter alignment. This alignment happens offline- hence the database of alignments. (source code)

We have been using this approach for many years. At the same time, template adaptation continues one of the biggest complaints about CX. See T102964: [Epic] Better support for templates in Content Translation

Apart from the challenging issues of missing templates, the difficulty in properly mapping the template parameters is they need lexical and semantic matching.

To illustrate, If we use Template:Infobox person from English wikipedia and try to match with French wikipedia, the parameter birth_date need to be mapped to date de naissance. honours is hommage. family is famille. The notable work param in English wikipedia for Template:Infobox writer` is magnum opus in another wiki. At present none of the above mappings are present in our database. String matching heuristics is not enough fo the above mappings too.

Based on a recent exploration, I found that there is an opportunity to improve the adaptation using sentencetransformer LaBSE models. The state-of-the-art model for bitext mining is the Sentence Transformer LaBSE model. It supports roughly 110 languages. LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.

Recently, LaBSE and sentence transformer-based models in general can be used with CPUs, thanks to the addition of OpenVINO and ONNX backends. https://embed.toolforge.org/ hosts the LaBSE model with an OpenVINO backend.

Using that API, we can get a semantic similarity matrix for template parameters between two languages. By using a sensible threshold, we can find which parameter matches to which target param best.

A demo of this system is available at https://people.wikimedia.org/~santhosh/template-alignment/.

Based on this proof concept:

  • Evaluate the effectiveness of semantic similarity in comparison with existing API and report objective measures on improvement change
  • Evaluate on very small languages and report the effectiveness
  • If evaluation succeeds, replace the existing alignment system with new one(this will require additional task for engineering work, especially on model inference scaling)

Event Timeline

eamedina renamed this task from Explore and evaluate template parameter alignement based on language agnostic embedding similarity to Explore and evaluate template parameter alignment based on language agnostic embedding similarity.Jan 14 2026, 12:41 PM