Page MenuHomePhabricator

Define template parameter mapping between languages as a wiki page
Open, NormalPublic

Description

Adapting Template parameter mapping between languages is one of the toughest problem ContentTranslation attempted to solve. We used templatedata extensively to find out corresponding parameters in target language. This was done using a relaxed parameter name matching.

But this approach has the following limitations

  1. Does not work when templatedata is missing or incomplete. If template data is missing, we tried to parse the template code in wikitext and extract parameters. This is very tricky and does not work always
  2. When parameter names are translated/transliterated in a language, our parameter matching will not work. A simple example is translated_title in one language, and transtitle in another language. The param name may be in another script too.
  3. If the template data defines aliases with one being same as English param name can help, but if the actual template does not support that alias, it is a problem.

Considering all these issues, we need to think about a more reliable and long term solution.

  1. Global templates can be a solution. But:
    • it is not there yet
    • When Global templates is a thing, it is not clear about the migration time
    • It is possible that wikis decide to have a local variant of the template for various reasons, such as culturaly appropirate variant.
  2. We can explore more sophisticated solutions to find template parameter mapping - such as predictor based on statistical learning

We see that the above solutions are not enough to meet our immediate goals

Our goals:

  1. A structured mapping between parameter names of a template in different languages.
  2. A way to extend this data easily so that we can cover more and more languages and templates. Ideally, a one time edit of this data should be enough for supporting a template between a language pair for ever.
  3. Editing this data should be very easy for users.

Proposed solution

  1. Use Extension:JsonConfig and define the template adaptation details, in meta.wikipedia.org. Details of the namespace, data format to be defined later)
  2. CXServer reads this data and use it for adapting templates
  3. CX UI provides an easy way for users to reach this data page and edit.
  4. The editing of data can be raw data edit or we can try to build a UI for that

Details

Related Gerrit Patches:
mediawiki/services/cxserver : masterRead template mapping from a customizable wikipage

Event Timeline

santhosh created this task.Apr 22 2019, 6:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 22 2019, 6:55 AM

Global templates (T121470) definitely are the right solution. Global templates are don't just about translatable—they are also about easing the burden of maintaining code that should be common, and about making the templates semantic. ("Semantic" means, for example, that it will be possible to define the function of the template in a way that will be machine-readable and cross-project, and that it will be possible to add a button to the VE toolbar that inserts the template in a meaningful and easy way, without having to write complex ultra-custom gadgets.)

But indeed, we don't know when will true Global templates happen, and improving things in the meantime is a desirable thing.

The proposed solution sounds right, but it really should be seen as a temporary solution, and as a step towards global templates. So, for example, if it helps editors in different languages map parameters and functionalities in a structured way, then it's a good step because it will help properly unify the templates later.

Making it editable similarly to how wiki page are edited is a good idea, because that's the main strength of templates to begin with: they feel like they are a part of the wiki, and they don't require a difficult code review and deployment process. It may be a good idea to make this as structured as possible so that people won't make over-customized and possibly invalid JSON.

Pginer-WMF triaged this task as Normal priority.Apr 22 2019, 11:37 AM
Pginer-WMF moved this task from Needs Triage to Enhancements on the ContentTranslation board.

Change 506611 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Read template mapping from a customizable wikipage

https://gerrit.wikimedia.org/r/506611

In https://gerrit.wikimedia.org/r/506611, I started the first step towards this goal.

It allows cxserver to read the template parameter mapping from a configured wiki domain, from a customizable title path.

The only requirement as of now is, the content should be json with the following structure:

{
  params: {
    source_param_name: target_param_name
  }
}

This can exist in a URL such as MediaWiki:TemplateMapping/en/Cite_web/es. But the order of params and prefix are customizable.
An example page: https://meta.wikimedia.org/wiki/User:Santhosh.thottingal/MediaWiki:TemplateMapping/en/Cite_web/es.json

(I noticed that if en->es mapping exists, we can more or less derive es->en mapping from json)

Is it possible to use Wikidata properties as a canonical representation of template parameters?
For example, the author parameter for a template in French ("auteur") can be mapped to the Wikidata P50 (author) property. This could help to reduce the number of mappings to define since each language version just requires to be mapped once with the Wikidata properties, not once for each possible language pair combination.

It is unclear whether Wikidata properties can support all template parameters. For example, some templates use name1, name2, etc. parameters for the name of multiple authors. Nevertheless, even if Wikidata cannot support all cases, it can still be useful if it helps with frequent/important/most cases.

This also opens the question on whether the mapping metadata makes more sense: a centralized wiki, as part of templateData for each template in each wiki, or as part of Wikidata.

@JKatzWMF was pointing to the role of Wikidata as a way to help with template mapping. Jon, feel free to comment if you have further thoughts in this area.

I'm in support of the canonical identifier of template parameter keys being Wikidata properties in the cases they're available. I lean toward TemplateData for the time being as the place to forge the connection between template keys and Wikidata properties. As people inevitably copy TemplateData templates across wikis in the current state that ought to make for more consistency on the semantic meaning of the templates, at least.

@santhosh would it be possible to add support for that to TemplateData and snap into CX? If not, would there be a twist on the JsonSchema approach you're using (e.g., a qqq.json for the source wiki's template) that could accomplish something similar?

@Lydia_Pintscher:

  • The properties on Wikidata are pretty stable, right? So there wouldn't be much risk of the meaning of a property changing out from under wikis if there were this sort of mapping, correct?
  • I believe this sort of mapping would be forward compatible with Wikidata client editing, but are there gotchas?
  • Could this be a contribution vector to Wikidata on the property internationalization? @Pginer-WMF I think this might be a nice aspect of the UX's persistence approach if you're open to it, although looking to @Lydia_Pintscher for advice on data normalization/federation approach.
  • Also, is there a desired convention for pointing to the identifier of a composite grouped statement? Maybe this part is too complex to try to represent in a mapping in something like TemplateData, but if it's possible to essentially point to the abstract class or duck typed array representing the composite grouped statement, at least it would make the parameter key mapping semantically valid and less ambiguous (plus again, potential contribution vector).

As for modeling of templates themselves and addressing content sufficiency against public ontologies I think things along the lines of ShEx and probably some mainstreaming of ontology support into TemplateData can eventually make that possible. As noted on the global templates task that's more a matter of inter- and intra-wiki governance for different wikis - important conversations to have and be involved with, to be sure. To be clear, I think those sorts of developments would very much be a positive step for user-facing workflows and machine readability, but one step at a time, too, I think.

@Sj curious to hear your thoughts in general and any insights as they pertain specifically to this task.

ContentTranslation already uses every information found in templatedata for source and target language pairs. The parameters are matched in after a normalization(removing punctuations, spaces, making it case insetive), using the aliases defined etc. But none of these can substitute a "canonical parameter name or id" that will connect between parameters in language with another language. Wikidata Id could be an additional item in templatedata that help us to do this connections. It won't be difficult to add that to TemplateData schema and its editing UI.

We were asking the community members to add an alias to parameters that will works as canonical id. This alias, if in english, and used as alias for param in every language, CX can use it to connect. To illustrate. If template param in Malayalam "വയസ്സ്" has an alias "age" and Spanish parameter "edad" also has an alias "age" we can connect these parameters. But just adding a canonical alias has a problem that the template soure code should also respect it.

If this canonical id for parameters are wikidata ids(Provided it exist, many of these parameters are like programming language variables name1, name2, author1, author2 etc) they are helpful, but this will work only if editors are willing to update these templatedata in all wikis in consistant way. This would require some communication between editors in one wiki with another wiki to make sure each parameter in each template points to the same idea. This is obviously a hard social problem. Note that editing templatedata require non-default roles in a wiki.

We should also note that templatedata coverage for templates in wikis is nowhere near 100%. Especially in smaller wikis, many templates does not have upto date templatedata. When templatedata does not exist, cxserver tries to extract the parameters by parsing the template code(!)

About this ticket, we paused it for a while since we were experimenting with machine learning approach to find template parameter mapping. The status reports in T224721: Integrate template parameter alignments in Content Translation to improve automatic template support would be interesting to you

It seems that there are two different aspects being considered:

  1. Where should users provide the mappings:
    • In an existing place (e.g., templateData, Wikidata...). Avoids creating another place for metadata, but may have limitations because it was not designed for this kind of mappings (e.g., fragmented across different language wikis).
    • In a new place specifically designed for this purpose. Better support and instructions can be provided for the task, but it is a new place for users to learn.
  2. How to provide the mapping:
    • Making one mapping for each language pair. Supporting a parameter to be translated across 4 languages, requires 16 mappings (or 6 if we consider bidirectional mappings). It is a more verbose but more explicit process: defining a particular mapping defines how data is transferred for that case.
    • Mapping to a canonical value. Supporting a parameter to be translated across 4 languages, requires only 4 mappings to connect each language version to the common value (e.g, P50 for author). It is a more efficient, but more abstract process: defining a mapping is about describing it in a common language (Wikidata properties) that the user needs to know/navigate and may not cover all cases (e.g, supporting author2 vs. author3, or author-link) or may generate ambiguities (the "name" parameter can be mapped in different languages to P2561: name, P1448: official name, or P1559: name in native language).

Pleas, let me know if there are other aspects and considerations not captured by the above summary.

Thanks @santhosh and @Pginer-WMF. Lydia and I spoke. Just to close the loop on one my earlier questions, template key-to-Wikidata property association does not clash with Wikidata client editing plans as of now. Lydia also reinforced the value of data mining in working through potential futures (some of which is happening I see). There are no simple approaches, to be sure, but I did want to close the loop on this one item.

awight added a subscriber: awight.Aug 27 2019, 7:58 AM

I'm curious to learn more about the attempt documented in T221211 and mentioned here,

  1. We can explore more sophisticated solutions to find template parameter mapping - such as predictor based on statistical learning

We see that the above solutions are not enough to meet our immediate goals

In the machine learning task, I see that evaluation will be done manually,

Evaluation: Given the lack of ground truth, we do a expert-based evaluation, given the results to the Language-Team

There are a few more indications that the model was promising, for example "that there are no false positives in the obtained mappings" when spot-checking a sample in T224721#5259225, so I'm wondering in what way this model falls short of the goals. Is the full evaluation public, or would you mind sharing? I can imagine a thorough evaluation would be possible by comparing the initial, machine-aligned draft translations with the final, human-adjusted, published article text—maybe this was done already?

awight updated the task description. (Show Details)Aug 27 2019, 8:13 AM

There are a few more indications that the model was promising, for example "that there are no false positives in the obtained mappings" when spot-checking a sample in T224721#5259225, so I'm wondering in what way this model falls short of the goals. Is the full evaluation public, or would you mind sharing? I can imagine a thorough evaluation would be possible by comparing the initial, machine-aligned draft translations with the final, human-adjusted, published article text—maybe this was done already?

The machine learning approach from T221211 is very promising, and we are working on integrating it in Content translation (T224721). Once that is complete, we expect that more information from templates will be successfully transferred across languages. Our plan is to enable the machine learning approach, and evaluate how well the automatic process works in practice with samples from real articles, user feedback and instrumenting templates/parameters that fail to adapt.

However, this approach is not expected to cover all cases. On the one hand, since the alignments are generated in advance we are doing it for the most common templates and languages, but there may be a longer tail that are unsupported. On the other hand, from the initial evaluation (T224721#5258865) made by @santhosh on a particular case we checked that we are getting mappings that were not supported before, but there are still parameters missing. Thus, providing users a way to correct mappings or provide the missing ones manually may still be useful.