Page MenuHomePhabricator

Math formulas computed as unmodified MT preventing article to be published
Closed, ResolvedPublic

Description

Although warnings are not shown for math formulas at a paragraph level (T225118), they seem to still be counted when deciding to prevent the publication of the article.
A user in this comment reports problems when translating the article "Prime omega function" from English to French.

I tried to reproduce the scenario by modifying text in the initial paragraph to avoid the 99% limit on unmodified MT for the whole document, and kept adding paragraphs and publishing the translation until the tool prevented me from publishing.
According to the current translation limits the publishing error should happen when reaching 50 paragraphs with a warning. However, in the image below you could check that only 25 paragraphs have a warning, while many other paragraphs consisting of a math formula do not show any warning. Math seems to still be computed in some way:

screencapture-fr-wikipedia-org-wiki-Special-ContentTranslation-2020-02-21-14_50_16.png (11×1 px, 1 MB)

(This is a long full-page screenshot, download and open locally for a better view)

Event Timeline

Pginer-WMF triaged this task as Medium priority.Feb 21 2020, 2:15 PM
Pginer-WMF updated the task description. (Show Details)
Pginer-WMF moved this task from Needs Triage to Bugs on the ContentTranslation board.

To investigate the issue, I followed the same steps as in the issue description:

Translate Prime omega function article from English. Edit the first paragraph by adding a new sentence. Translate all sections till the section with "The computations expanded in Chapter 22.11 of Hardy and"
Observed that I have 23 sections with warnings for unmodified content - They are shown in tools column.

I have 50 sections in the target article

The publishing fails. Why?

Publishing step has two validations for MT abuse

  1. If number of sections with MT Abuse is greater than 50. In this case it is 23. (For debugging, this is the output of mw.cx.translation.translationController.translationTracker.sectionsWithMTAbuse().length). This number 23 confirms that math formala sections were not considered for MT abuse check.
  2. Unmodified MT content in the translation. This algorithm does not consider the number of sections, but consider the actual content in the translations. The content from sections of translations is tokenized. These tokens are compared against raw MT for each sections. The percentage of tokens modified should be less than 99%(configured by wgContentTranslationUnmodifiedMTThresholdForPublish). In the case of our test article, mw.cx.translation.translationController.translationTracker.getTargetSectionModels() gives 27 sections to consider for this tokenized calculation. 27+23=50. The token based unmodified content calculation gave mw.cx.translation.translationController.translationTracker.getUnmodifiedMTPercentageInTranslation() = 99.28507596067918. Which is greater than 99 and publishing fails.

I added a sentence to the 50th section. and mw.cx.translation.translationController.translationTracker.getUnmodifiedMTPercentageInTranslation() is now 98.57904085257549

I am able to publish after that.

Summary: Math formulas are ignored in the calculation. Unmodified MT content in 27 sections was causing the failure.

(Technical note: mw.cx.translation does not exist. For debugging purpose I exposed it to global scope by editing mw.cx.init.js)