Page MenuHomePhabricator

MinT for Wikipedia Readers: Large sections fails to translate
Closed, ResolvedPublic4 Estimated Story PointsBUG REPORT

Assigned To
Authored By
santhosh
Oct 10 2024, 6:12 AM
Referenced Files
F58913857: Screenshot 2025-03-25 at 09.25.39.png
Mar 25 2025, 6:29 AM
F58913851: Screenshot 2025-03-25 at 09.21.37.png
Mar 25 2025, 6:29 AM
F58865697: image.png
Mar 19 2025, 7:48 AM
F58865438: image.png
Mar 19 2025, 7:41 AM
F57659764: translate.wmcloud.org_html(Wiki Tablet).png
Oct 30 2024, 3:42 PM
F57659732: Screenshot 2024-10-30 at 16.09.03.png
Oct 30 2024, 3:42 PM
F57659730: Screenshot 2024-10-30 at 16.09.30.png
Oct 30 2024, 3:42 PM
F57602968: image.png
Oct 10 2024, 6:12 AM

Description

Visit https://ig.m.wikipedia.org/wiki/Special:AutomaticTranslation?page=Tokyo&from=en&to=hi&step=translation

Notice that some sections cannot be expanded and has the following error from requests:

image.png (592×1 px, 81 KB)

Since there is no error handling, the loading indicator is shown forever

Event Timeline

It makes sense to improve error handling. In addition to that, I wonder if there is a broader aspect to consider. Based on the screenshot, the error seems to be produced by some limits in the size of the content requested. If I recall correctly, those limits were introduced in CXServer to avoid issues with external translation services, and they were creating issues when translating large elements such as tables.

If this were the case, I wonder if it would make sense to consider a separate ticket to make the length limitations less strict for services running on Wikimedia infrastructure, such as MinT and Apertium. That would apply to all products using them, not specific to MinT for Wiki Readers.

Since we prioritize user experience, and sending a large chunk will have proportional delay in response time, cxserver accepting larger chunk will not help users. The clients need to send smaller chunks of content in sequential batches.

For example, if section has 5 paragraphs, send them by paragraphs one after another rather than the full paragraph. If a section is li or any such block tags, send them by block tags. This is how community wishlist implemented their MT feature. In this particular example of Tokyo, we are sending a references section with 229 reference items in one go. First of all, reference support is already broken. Secondly, references are unrelated units and can be independendly translated/adapted, hence they can be either not sent for translation or just send them as smaller chunks.

Reference resolution in html is very hard since there reference pointer and reference content will be in two locations and one such location is yet to be processed/parsed/translated.

It may be too late, but this special page, in a sense , is a readonly version of second column of CX where we have solved all of these issues in past several years.

Since we prioritize user experience, and sending a large chunk will have proportional delay in response time, cxserver accepting larger chunk will not help users. The clients need to send smaller chunks of content in sequential batches.

We definitely want clients to make requests in the best possible way. I think that this ticket is a valid one.

I also remember that some pieces of content cause problems in existing tools such as Content Translation in T216583: [wmf.18] Large table cannot be translated - 'Automatic translation failed' is displayed.. Quoting the final summary below since it describes some of the current limits:

Content Translation uses service we call cxserver (which is developed for purpose of serving Content Translation mainly) to translate and adapt content between languages. It splits source article into pieces (sections) that you can translate individually. That means those 17k of wikitext are not translated in whole.
Furthermore, section content is sent to cxserver as HTML, not as wikitext, which means it is bigger than wikitext, because of accompanied HTML markup. Sometimes that difference in length is significant.

When we send that HTML of section we want to translate, in your case from Russian to Ukrainian, cxserver can reject the request because content is too big because of two similar, but different reasons:

  1. First line of checks is that content is not bigger than 500,000 bytes (0.5 megabytes)
  2. If content is smaller than 0.5 MB, but number of characters is greater than 10,000, we don't even try sending it to engines like Yandex, cxserver rejects that request as well

In case #1, we're stuck with "Start with an empty paragraph", as "Copy original content" option also requires content smaller than 0.5MB.
Translating table under section header Актёр, that you added as an example, falls under case #2, where the content is rejected because it exceeds 10,000 characters (it is 40,591 characters), but "Copy original content" works as a fallback option.

I don't have all details fresh since this was reported in 2019 and I got a bit lost on the technical details on why it was complex to reduce the request size in such case, but I think that there may be an opportunity to make the translation size unit less strict for services where we don't have hard constraints. I think it is possible to increase the max request size allowed (to avoid some corner cases) while still encouraging requests to be minimal, in order to provide the best possible user experience: fast translation for most content while still getting a translation for the case of the above table.

Reference resolution in html is very hard since there reference pointer and reference content will be in two locations and one such location is yet to be processed/parsed/translated.

It may be too late, but this special page, in a sense , is a readonly version of second column of CX where we have solved all of these issues in past several years.

This is not clear to me. Rendered references don't seem very different than other pieces of content with texts and external links, so it may be useful to provide more detail in T376860 since it seems more specific to reference support.

If we look at the [1] citation and the corresponding reference we have the following:

Screenshot 2024-10-30 at 16.09.30.png (96×259 px, 17 KB)
Screenshot 2024-10-30 at 16.09.03.png (112×488 px, 21 KB)
A link with "[1]" as text and a link target pointing to "#cite_note-1" which refers to the reference belowA text paragraph with two links to external sites.

Using the HTML translation of the MinT test instance and pasting the contents from the reference, MinT seems able to translate the contents:

translate.wmcloud.org_html(Wiki Tablet).png (768×1 px, 90 KB)

The only apparent issue is for a link to get lost, but that seems more of a general issue of the algorithm to re-apply links (shared with Content Translation and reported in tickets such as T314127)

For a reader context, it seems more straightforward to translate the rendered contents of a reference than trying to apply the whole adaptation process. Template adaptation makes sense in the editing context of Content Translation, since the final contents can only have templates defined in the target wiki. However, in a reader context, there is no problem to use the source templates with translated content. Doing template adaptation for readers seems more problematic since the template may not exist in the target language, references may be inside another template (which is unsupported, T209266) or may fall in the cases listed under T200786: Better support for References in Content Translation (epic).

Triage meeting notes: Currently the whole section is sent. It is possible to split section requests when they exceed the limit. However, that would result in increasing the number of requests as reported in T378326. As an initial measure, we could try to remove the artificial limit that we don't need to apply to MinT. Then check which is the resulting performance and decide whether splitting them (and generating more requests) is worth it.

Pginer-WMF triaged this task as Medium priority.Oct 31 2024, 8:54 AM
Pginer-WMF moved this task from Backlog to Product integration on the MinT board.

24 out of 48 mentioned in T378326: MinT for Wikipedia Readers: Reduce parallel MT requests on page load are pre-flight requests (OPTIONS) .

I'd still prefer to do this:

Triage meeting notes: Currently the whole section is sent. It is possible to split section requests when they exceed the limit.

I think if we limit the number of parallel requests and instead send 1 or 2 requests at a time, it might not result in a large number of requests. I find responses from MinT for smaller sections to be much faster.

Change #1127943 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] AX: Refactor section-by-section translation

https://gerrit.wikimedia.org/r/1127943

abi_ changed the task status from Open to In Progress.Mar 17 2025, 11:31 AM
abi_ set the point value for this task to 4.Mar 19 2025, 4:25 AM

Some notes from my tests

  1. Can we add some margin between the error messages?
    image.png (1×567 px, 155 KB)
  2. I'm noticing that the later subsections load before the lead sub section inside a section. For me this UX is not very good. While we translate subsections one by one, would it be easy to display the initial sections first and then the subsequent ones? Because a section is not really readable unless the lead subsection loads first. OK to do this in a follow patch.
  3. As a user its not immediately understandable when it says "Translation failed" yet I see some content appear. Can we make the message a little more clearer that a sub section inside the section failed to translate?
  4. If there are multiple failed subsections, clicking on reload on a sub section in the middle does not show the loading symbol. Checked on Firefox.
  5. On article: Special:AutomaticTranslation?page=Moon&from=en&to=hi&step=translation. Potentially caused by 81672461317b3c317e73ef9ab8ce7acd33dc1a0c
    1. Expand बाहरी लिंक section
    2. I notice the following error in the console: Error while translating section 'External links' and subsection with index 0 TypeError: dataCX.sourceTitle is undefined --- adaptLinks useSectionTranslate.js:41
  6. When the lead section is loading, and the last subsection has finished loading, the loading symbol disappears. As a user that gives me the impression that there is nothing more in the article. Can we show the loading until the section headings are visible?
    image.png (1×596 px, 147 KB)
  7. Point of discussion: One of the drawbacks of the approach that we are taking here is that we end up firing a lot of requests to cxserver each with some minor overhead. I wonder if this is desirable to not do subsection level translation if the number of subsections is low or the content size is small.

Submitted: 1129174: AX: ViewTranslationPage: Avoid adapting links without sourceTitle | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1129174 to fix 5)

Change #1127943 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] AX: Refactor section-by-section translation

https://gerrit.wikimedia.org/r/1127943

Tested; all sections load as expected.

Exception:

  • The reference section still has that error.

Screenshot 2025-03-25 at 09.21.37.png (1×3 px, 449 KB)
Screenshot 2025-03-25 at 09.25.39.png (1×3 px, 438 KB)

Tested; all sections load as expected.

Exception:

  • The reference section still has that error.

Screenshot 2025-03-25 at 09.21.37.png (1×3 px, 449 KB)
Screenshot 2025-03-25 at 09.25.39.png (1×3 px, 438 KB)

Thanks for catching that. I noticed that while testing: T376860: MinT for Wikipedia Readers: All references are missing; I'd recommend we scope this work to that task. I'll leave a comment there.

Tested; all sections load as expected.

Exception:

  • The reference section still has that error.

Screenshot 2025-03-25 at 09.21.37.png (1×3 px, 449 KB)
Screenshot 2025-03-25 at 09.25.39.png (1×3 px, 438 KB)

Since we are tracking that as part of T376860: MinT for Wikipedia Readers: All references are missing; marking this as done.