Page MenuHomePhabricator

Integrate template parameter alignments in Content Translation to improve automatic template support
Open, HighPublic

Description

Content translation looks for different metadata to transfer the contents from a template in the source document to the equivalent template in the translation.

A machine learning approach was applied in this context (T221211) to identify the mappings/alignments of parameters for the most used templates and language pairs (generated alignments).

This ticket proposes to integrate the generated alignments in Content translation as an additional criteria to consider during the parameter mapping process. When a template is added to the translation for a language pair with alignment data available,the alignments will be used to identify additional mappings that could not be identified with the default approaches. That is, metadata from templateData and parsoid will still be used anyways, the alignments will surface additional possible mappings that were not considered before.

Since the alignment information comes with probability data, we need to define a reasonable threshold. In this case I think it makes more sense to err on the side of the information being copied to the wrong parameter (high coverage) rather than being lost (high accuracy), but we may need to experiment and iterate on the exact value.

Regarding metrics, it would be great to measure how many templates can be adapted with this method, in general, and compared to those incomplete or not adapted. Depending on the complexity of this, a separate ticket can be created.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 31 2019, 9:54 AM
Pginer-WMF triaged this task as High priority.May 31 2019, 9:54 AM
Pginer-WMF updated the task description. (Show Details)May 31 2019, 10:20 AM

After a brief analysis of the generated JSON mapping, this is what I am planning to do:

The JSON files are big. Parsing and doing lookup is going to be slow. Even if we plan to load and keep in cache, it is 12 MB for the 210 files for 15 languages. And expected to grow as we add more languages. So I plan to load all these data to an sqlite database and use queries to check if a mapping exist and retrieve its mapping.

I wrote a script to do that:

1const fs = require('fs'),
2 ArgumentParser = require('argparse').ArgumentParser,
3 sqlite = require('sqlite'); // https://github.com/kriasoft/node-sqlite
4
5async function createTemplate(db, from, to, templateName) {
6 const mapping = await db.get(`SELECT rowid FROM templates
7 WHERE source_lang = ? AND target_lang = ? AND template =?`,
8 from, to, templateName);
9 if (mapping && mapping.rowid) {
10 return mapping.rowid
11 }
12 const result = await db.run(`INSERT OR IGNORE INTO templates
13 (source_lang, target_lang, template) VALUES(?,?,?)`,
14 from, to, templateName);
15 return result.lastID;
16}
17
18async function main(databaseFile, mapping, from, to) {
19 const db = await sqlite.open(databaseFile, { Promise });
20
21 await db.run(`CREATE TABLE IF NOT EXISTS templates (
22 source_lang TEXT NOT NULL,
23 target_lang TEXT NOT NULL,
24 template TEXT NOT NULL,
25 UNIQUE(source_lang, target_lang, template)
26 )`
27 );
28 await db.run(`CREATE TABLE IF NOT EXISTS mapping (
29 template_mapping_id INTEGER NOT NULL,
30 source_param TEXT NOT NULL,
31 target_param TEXT NOT NULL,
32 score REAL NOT NULL,
33 UNIQUE(template_mapping_id, source_param, target_param)
34 )`);
35
36 for (const templateName in mapping) {
37 let mappingId, mappingData = mapping[templateName];
38 mappingId = await createTemplate(db, from, to, templateName);
39 console.log(`${mappingId} ${from} ${to} ${templateName}`);
40 for (let index in mappingData) {
41 let paramMapping = mappingData[index];
42 if (!mappingId || !paramMapping[from] || !paramMapping[to]) {
43 continue;
44 }
45
46 await db.run(`INSERT OR IGNORE INTO mapping
47 (template_mapping_id, source_param, target_param, score)
48 VALUES(?,?,?,?)`,
49 mappingId, paramMapping[from], paramMapping[to], paramMapping.d)
50 console.log(`${paramMapping[from]} -> ${paramMapping[to]} [${paramMapping.d}]`);
51 }
52 }
53 await db.close()
54};
55
56
57const argparser = new ArgumentParser({
58 addHelp: true,
59 description: 'Prepare template mapping database'
60});
61
62argparser.addArgument(
63 ['-d', '--database'],
64 {
65 help: 'template mapping database file',
66 defaultValue: 'templatemapping.db'
67 }
68);
69argparser.addArgument(
70 ['-i', '--input'],
71 {
72 help: 'JSON file with mapping.',
73 required: true
74 }
75);
76argparser.addArgument(
77 ['--from'],
78 {
79 help: 'Source language',
80 required: true
81 }
82);
83argparser.addArgument(
84 ['--to'],
85 {
86 help: 'Target language',
87 required: true
88 }
89);
90const args = argparser.parseArgs();
91const databaseFile = args.database;
92const input = args.input;
93if (!fs.existsSync(input)) {
94 throw Error(`File ${input} does not exist`);
95}
96
97const mapping = JSON.parse(fs.readFileSync(input));
98main(databaseFile, mapping, args.from,args.to)

Change 517056 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Add scripts to load the template mapping json to sqlite database

https://gerrit.wikimedia.org/r/517056

Change 517057 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Integrate template parameter alignments

https://gerrit.wikimedia.org/r/517057

santhosh added a comment.EditedJun 14 2019, 12:16 PM

For the purpose of review/test, I am going to explain a template and its mapping using the databse here. This is the same template used for the unit test of above patch

Let us use Cite Conference template - This is a reference template

Source template: https://es.wikipedia.org/wiki/Plantilla:Cita_conferencia
Target tempalte: https://ca.wikipedia.org/wiki/Plantilla:Citar_confer%C3%A8ncia

You can see that both does not have template data.

A real content that uses this template:

{{Cita conferencia |
autor=Naciones Unidas |
enlaceautor=Naciones Unidas |
fecha=18 de marzo de 2015 |
año=2015 |
mes=marzo |
título=Tercera Conferencia Mundial de las Naciones Unidas sobre la Reducción del Riesgo de Desastres (Manual de la Conferencia) |
conferencia=Tercera Conferencia Mundial de las Naciones Unidas |
editor=WCDRR |
ubicación=Sendai, Japón |
páginas=21 |
url=http://www.wcdrr.org/uploads/UN-WCDRR-CH-Es.pdf |
formato=PDF |
fechaacceso=8 de mayo de 2015}}

The template alignment system gave us the following mapping and scores:

Source paramTarget ParamScore
apellidocognom0.733333677522359
añoany0.621203767666562
conferenciaconferència0.79699115204052
fechadata0.562465710685373
nombrenom0.770534231312654
títulotítol0.674103936218413
urlurl0.599459685454743
Source param (es)Source valuetarget param(ca)Explanation
autorNaciones UnidasautorEven though these params are same in source and target, the alignment tool did not give this mapping. CXserver's mapping algorith used
enlaceautorNaciones Unidas-Not able to map
fecha18 de marzo de 2015dataUsing the database
año2015anyUsing the database
mesmarzomesCXServer algorithm
títuloTercera Conferencia Mundial de las Naciones Unidas sobre la Reducción del Riesgo de Desastres (Manual de la Conferencia)títolUsing the database
conferenciaTercera Conferencia Mundial de las Naciones UnidasconferènciaUsing the database
editorWCDRReditorCXServer algorithm. alignment tool did not give this mapping
ubicaciónSendai, Japón-Not able to map. Expected: location
páginas21-Not able to map. Expected: pages
urlhttp://www.wcdrr.org/uploads/UN-WCDRR-CH-Es.pdfurlCXServer algorithm. Alignment tool also gave the mapping
formatoPDF-Not able to map. Expected: format
fechaacceso8 de mayo de 2015-Not able to map. Expected: consulta

The template alignment system gave us the following mapping and scores:

Source paramTarget ParamScore
apellidocognom0.733333677522359
añoany0.621203767666562
conferenciaconferència0.79699115204052
fechadata0.562465710685373
nombrenom0.770534231312654
títulotítol0.674103936218413
urlurl0.599459685454743

Great work, and thanks for the clear example, @santhosh. It is great to see that this is automatically providing extra mappings that we were not finding before.

It's interesting that the mapping was found for word pairs that are very different such as apellido/cognom, or fecha/data; but it failed to find common similar words such as formato/format or páginas/pàgines. Maybe @diego can confirm whether these were cut out because of the threshold, because those words were not available in the corpora used, or something else.

Given that there are no false positives in the obtained mappings, if this example were representative, we may even consider making the threshold a bit less strict to get some more mappings.

santhosh claimed this task.Jun 17 2019, 8:17 AM

Change 517056 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Add scripts to load the template mapping json to sqlite database

https://gerrit.wikimedia.org/r/517056

Change 517057 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Integrate template parameter alignments

https://gerrit.wikimedia.org/r/517057