Page MenuHomePhabricator

Threshold to prevent publishing needs more precision
Closed, ResolvedPublic

Description

The measurement of unmodified contents for the whole document (T190283) is not currently calculating the total percentage of unmodified content for the document accurately. Adding one paragraph to the translation and modifying a few words is sometimes allowed even when the limit is set to a strict threshold as mentioned in T219851#5139596.

The threshold should consider the initial content for all the paragraphs and count all the modifications made to obtain the percentage of modification for the whole document. Some paragraphs such as the list of references will be skipped from the calculation, as they are currently skipped for the calculation of the limits on a paragraph basis.

Some examples:

  • A section title that has 2 words that remain unmodified followed by a paragraph with 99 words that is rewritten completely will have a 2% (2 words out of 100 total words remain unchanged) of unmodified content for the whole document.
  • A document with 2 paragraphs. One with 30 words where 10 remain unmodified and another with 70 words wher 5 remain unmodified, will have a total of 15% unmodified content (15 words out of 100 total words).

Event Timeline

Change 506705 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@master] Change the way we calculate total unmodified MT

https://gerrit.wikimedia.org/r/506705

Change 506705 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Change the way we calculate total unmodified MT

https://gerrit.wikimedia.org/r/506705

Change 506971 had a related patch set uploaded (by KartikMistry; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@wmf/1.34.0-wmf.1] Change the way we calculate total unmodified MT

https://gerrit.wikimedia.org/r/506971

Change 506971 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@wmf/1.34.0-wmf.1] Change the way we calculate total unmodified MT

https://gerrit.wikimedia.org/r/506971

Mentioned in SAL (#wikimedia-operations) [2019-04-29T11:56:24Z] <kartik@deploy1001> Synchronized php-1.34.0-wmf.1/extensions/ContentTranslation: SWAT: [[gerrit|506971|Change the way we calculate total unmodified MT (T221930)]] (duration: 00m 56s)

Checked in testwiki wmf.4

Some issues
(1)

The threshold should consider the initial content for all the paragraphs and count all the modifications made to obtain the percentage of modification for the whole document.

A document with 2 paragraphs. One with 30 words where 10 remain unmodified and another with 70 words wher 5 remain unmodified, will have a total of 15% unmodified content (15 words out of 100 total words).

The information of how much of overall content is modified is not communicated to users. A user sees only the percentage of unmodified content only per paragraph.
It should be taken into account how much of overall content has been subjected to translation. I noticed that short articles are much more often display some incorrect behavior on count for unmodified content upon publishing.
Another confusing point is the progress bar (on 'In progress').

The specific example: en:RK Pet to español.
a) two first paragraphs are MT translated. First paragraph has 18 words, another 41, so total 59.
b) 18-words paragraph is completely modified, and the second (41-word paragraph) has only two words modified.
Now we have 18+2=20 words modified. That gives 34% modified and 68% unmodified. I cannot publish it - there is a Console error mentioned in T219851:

TypeError: string is undefined
tokenise
getUnmodifiedMTPercentageInTranslation

(2) Per paragraph the percentage count is really accurate.

(3) The progress bar (on 'In progress') often shows less MT percentage than the issue card. Is the calculation is different there? Does it calculate the modified percentage against the whole content even though the other content was not touched in terms of translating?
For example, the article has 100 words and 2 paragraphs - MT translated one and one was not touched. So, the issue card will show 100% for the MT translated paragraph and the progress bar will show 50%?

Checked in testwiki wmf.4

Some issues
(1)

The threshold should consider the initial content for all the paragraphs and count all the modifications made to obtain the percentage of modification for the whole document.

In my experience, at least in production, I was seeing the percentages in both cases, for the warning (at paragraph level) and for the error (the whole document):

May-08-2019 11-52-27.gif (362×756 px, 2 MB)

Is the card not appearing in your case?

Since it was not completely clear which message refers to the paragraph and the document, I proposed clarifying it in T222779, but the cards are expected to appear showing the percentages.

(3) The progress bar (on 'In progress') often shows less MT percentage than the issue card. Is the calculation is different there? Does it calculate the modified percentage against the whole content even though the other content was not touched in terms of translating?
For example, the article has 100 words and 2 paragraphs - MT translated one and one was not touched. So, the issue card will show 100% for the MT translated paragraph and the progress bar will show 50%?

I made the following test: start a translation, add 2 paragraphs with MT, delete one word of each paragraph, and go back to the dashboard. The tooltip shows 0% of MT which should not be the case:

Screenshot 2019-05-09 at 12.40.00.png (451×932 px, 94 KB)

I repeated the test without deleting any content , and it was correctly showing 100% of MT.

@Petar.petkovic do you think this is something specific of the progress bar or a general issue of the percentage calculation? If it is the former we can create a separate ticket, but I wanted to confirm before.

@Petar.petkovic do you think this is something specific of the progress bar or a general issue of the percentage calculation? If it is the former we can create a separate ticket, but I wanted to confirm before.

This is specific to how progress bar calculations are made.

Before patch for this ticket MT percentage for the whole document was translated by dividing:
sections_with_any_modification / total_number_of_translated_sections.
Even deleting whitespace counted as modification. We changed the approach to take all sections that are verified for MT abuse (excluding section titles, reference list, tables, (un)ordered lists, images and block templates) and calculating number of unmodified words.

For progress bar calculation, number of sections with any modification is still used when displaying MT percentage.

This is specific to how progress bar calculations are made.

Ok. Thanks for the clarification. I created a separate ticket: T222892: Progress bar in the translation dashboard not representing accurately the percentage of machine translation
Feel free to add further details there if you see anything missing.

@Pginer-WMF - thx for creating phab task(s)!
I re-checked in wmf.5 - the card correctly shows the percentage of modified text in different pargraph.

Except for two edge cases (I'll probably follow up with a phab task)

  • A paragraph is 100% MT translated, but some text is deleted - the text is considered modified, and the MT warning is not triggered.
  • A paragraph is 100% MT translated, some text was cut and then pasted back - the text considered modified and does not count toward MT threshold.

I am closing the task as Resolved by now and I marked this task on my QA task list as "needs more testing".