Page MenuHomePhabricator

Math formula should not be checked for unmodified content issues
Closed, ResolvedPublic

Description

Elements that are not expected to be edited by users are removed from being checked for MT abuse. This is the case for block images, block templates, tables, lists, section headings and more recently reference lists (T219468).

Based on this user report it seems that math formulas are also considered as "unmodified content". Math formulas should not be computed for this purpose, making it possible to publish them unchanged in the translation without that to account for the limits in publishing or causing the article to be added to a tracking category.

Event Timeline

In 16c7f13cadb246c1e5ff2dd2467c3b057d8ff0ac, I have added math formulas to the list of nodes excluded from MT validation. However, math formulas are wrapped in additional <p>, therefore not excluded right now, but should be excluded once additional wrapper is removed.

In 16c7f13cadb246c1e5ff2dd2467c3b057d8ff0ac, I have added math formulas to the list of nodes excluded from MT validation. However, math formulas are wrapped in additional <p>, therefore not excluded right now, but should be excluded once additional wrapper is removed.

Thanks for the details, @Petar.petkovic. Do we need a separate ticket for removing the additional <p> wrapping for math formulas?

In 16c7f13cadb246c1e5ff2dd2467c3b057d8ff0ac, I have added math formulas to the list of nodes excluded from MT validation. However, math formulas are wrapped in additional <p>, therefore not excluded right now, but should be excluded once additional wrapper is removed.

Thanks for the details, @Petar.petkovic. Do we need a separate ticket for removing the additional <p> wrapping for math formulas?

No, this one is sufficient for the purpose.

Pginer-WMF triaged this task as Medium priority.Jun 7 2019, 2:37 PM

We cannot remove the wrapping tags for math since that would cause unexpected rendering and editing behavior. Better keep the DOM as such. I propose excluding <dl> tags from validation. Math is under dl ->dd -> p tag.

Change 528312 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ContentTranslation@master] Exclude definitionList from MT usage validation

https://gerrit.wikimedia.org/r/528312

Change 528312 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Exclude definitionList from MT usage validation

https://gerrit.wikimedia.org/r/528312

Here is an example where the MT calcualtion does not happen. From es.wiki

image.png (347×1 px, 38 KB)

Here is an example where the MT calculation happens. From en.wiki
image.png (347×1 px, 82 KB)

Difference? es.wiki uses a block level template, while en.wiki has math inside a paragraph. Our code identify the section as paragraph in en wiki and fails to detect it as block level math formula.

Fix? I think we will have to look deep in to the paragraph and see if it has only one child and that is ignorable.

Change 529727 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ContentTranslation@master] Recursively check the section for nodes excluded from MT validation

https://gerrit.wikimedia.org/r/529727

Screenshot after the above patch - from en:Grüneisen_parameter

image.png (237×1 px, 62 KB)

Change 529727 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Recursively check the section for nodes excluded from MT validation

https://gerrit.wikimedia.org/r/529727

For some reason these two formulas are still being counted for MT abuse.

image.png (612×1 px, 126 KB)

For some reason these two formulas are still being counted for MT abuse.

image.png (612×1 px, 126 KB)

Good catch. I copied the problematic formulas from the article into a test page to facilitate developers to investigate.

For some reason these two formulas are still being counted for MT abuse.

image.png (612×1 px, 126 KB)

Math formulas on this screenshot have additional dot at the end, which causes the problem. T231835 claims same problem exists without that dot at the end, but I was not able to reproduce.

QA NOTE: try T231835 in production and also the one with the dot at the end

@Pginer-WMF the one with the dot at the end still has the issue (it counts for MT abuse), should we fix it on this ticket or open a new one?

@Pginer-WMF the one with the dot at the end still has the issue (it counts for MT abuse), should we fix it on this ticket or open a new one?

I created a separate ticket: T232718: Inline templates should not bypass the minimum length consideration for unmodified content issues
I tried to describe the more general issue as I understood it (not just specific to math formulas followed by a dot). @Jpita, @Petar.petkovic feel free to add further details there if I missed something. Thanks!

@Pginer-WMF the one with the dot at the end still has the issue (it counts for MT abuse), should we fix it on this ticket or open a new one?

I created a separate ticket: T232718: Inline templates should not bypass the minimum length consideration for unmodified content issues
I tried to describe the more general issue as I understood it (not just specific to math formulas followed by a dot). @Jpita, @Petar.petkovic feel free to add further details there if I missed something. Thanks!

Same problem can happen without math formulas, so it's good that your task tries to be generic.

What I want to tell is how we ended up in this situation. We have list of excluded nodes, which don't go through MT abuse validation. Math nodes are on that list. After it is determined that node isn't on the blacklist, we proceed with validation, but we stop early if word count is below 10.

For the step when we check if node is on the blacklist, since math formulas can be nested in other elements, we look deeper if we have linear structure of nesting, where only one child exists. So, if we have paragraph with math child, that is matched for exclusion. If we have paragraph with children math and text that is not matched, due to how our traversal works, skipping when there are siblings.
So, math with additional dot is bound for validation and next line of defense is check for number of tokens (usually words). Due to how obtaining text from such structures works, we end if with every character in math formula being counted as one token. If that number is above 10, MT abuse warning is registered. Some syntax structures also add up to the number of tokens, so we usually see the warning. If math formula was really short, MT validation could be prevented, which could confuse us even more.

I guess that your reasoning is that we should split math node and that dot and treat them separately, where math would be ignored and dot goes through text validation.
This idea is worth exploring.