Page MenuHomePhabricator

In ContentTranslation, some references are still going missing...
Closed, ResolvedPublic

Description

We at Wiki Project Med are working to relaunch our translation efforts. The problems of ContentTranslation dropping references; however, still persists.

Here was the source content being translated and here was the target content that resulted. https://ja.wikipedia.org/w/index.php?title=%E8%A1%80%E6%B8%85%E7%97%85&oldid=83393053 Here is the edit restoring all the missing / dropped references. https://ja.wikipedia.org/w/index.php?title=%E8%A1%80%E6%B8%85%E7%97%85&type=revision&diff=83393098&oldid=83393085

I previously believed that the issue occurred when the full definition of the reference occurred within the infobox so we had a bot move the full definitions out of the infobox for all our leads that are ready to translate. Yet the problems persists.


A more precise description, added by @Amire80: I've tried to translate https://en.wikipedia.org/w/index.php?oldid=1022115153 to Japanese. Three footnote appear at the bottom, albeit with some error messages. However, some footnote numbers don't appear in the published translation, for example the "Stat2020" reference in the beginning of the first paragraph. I do see it while I'm translating, but not in the published page.


Reproducible test cases

This example page contains 3 instances of the same reference. When it is translated from English to Japanese (using Google translate for each paragraph) the published result only contains 2 instances of the reference (i.e., one got lost).

The expected result would be for the three instances to be present in the published article.

Below you can see the source, translation and target screenshots. Notice that the reference next to the "kidney failure" link is only missing in the published content (see the red zone highlighted):

Source example contentSource and target in Content TranslationPublished content
Screenshot 2021-05-26 at 12.08.03 2.png (367×883 px, 126 KB)
Screenshot 2021-05-26 at 11.57.10 2.png (804×1 px, 199 KB)
Screenshot 2021-05-26 at 12.08.03 2 2.png (519×798 px, 156 KB)

(The published page also shows some reference errors because of the known issue of duplication of references by name: T203772: CX2: Improve support for references that are reused in multiple places)

Another test case to verify the solution against:

Here is a smaller amount of content were the ref do not come through

4 refs in the starting text https://en.wikipedia.org/wiki/User:Doc_James/CTX3

2 refs in the end text https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:Doc_James/CTX3

Event Timeline

santhosh renamed this task from In CTX2 references are still going missing... to In ContentTranslation references are still going missing....May 10 2021, 4:44 AM
santhosh updated the task description. (Show Details)
santhosh added a subscriber: santhosh.

Change title slighlty since CTX or CTX2 are not acronyms used for ContentTranslation

Here we have another article and language in which the translation tool struggles

https://or.wikipedia.org/w/index.php?title=%E0%AC%AC%E0%AD%8D%E0%AD%9F%E0%AC%AC%E0%AC%B9%E0%AC%BE%E0%AC%B0%E0%AC%95%E0%AC%BE%E0%AC%B0%E0%AD%80:Doc_James/Tricuspid_valve_stenosis&oldid=428053

A bunch of instances of the reference "<ref name=Stat2020/> have gone missing and some of them have been misformated into "<ref name="Stat2020">{{Cite journal|last=Golamari|first=R|last2=Bhattacharya|first2=PT|date=January 2020|title=Tricuspid Stenosis|pmid=29763166}}<cite class="citation journal cs1" data-ve-ignore="true" id="CITEREFGolamariBhattacharya2020">Golamari, R; Bhattacharya, PT (January 2020). "Tricuspid Stenosis". [[PMID (ପରିଚାୟକ)|PMID]]&nbsp;[//pubmed.ncbi.nlm.nih.gov/29763166 29763166].</cite><span data-ve-ignore="true"> </span><span class="cs1-hidden-error error citation-comment" data-ve-ignore="true">Cite journal requires <code class="cs1-code">&#x7C;journal=</code> ([[ସହଯୋଗ:CS1 errors|help]])</span></ref>"

These are the manual changes I made to fix the issue https://or.wikipedia.org/w/index.php?title=%E0%AC%9F%E0%AD%8D%E0%AC%B0%E0%AC%BE%E0%AC%87%E0%AC%95%E0%AC%B8%E0%AC%AA%E0%AC%BF%E0%AC%A1_%E0%AC%95%E0%AC%AA%E0%AC%BE%E0%AC%9F%E0%AC%BF%E0%AC%95%E0%AC%BE_%E0%AC%B8%E0%AD%8D%E0%AC%9F%E0%AD%87%E0%AC%A8%E0%AD%8B%E0%AC%B8%E0%AC%BF%E0%AC%B8&type=revision&diff=428054&oldid=428020

We, at Wiki Project Med, will try to build a work around. Basically what we will do is expand all the references in a userspace for the content to be translated. https://en.wikipedia.org/w/index.php?title=User:Mr._Ibrahem/Tropical_sprue2&diff=1023306255&oldid=1023305378&diffmode=source

People will translate using content translations.

And than we will have a bot shrink all the references once the translation is live. https://or.wikipedia.org/w/index.php?title=%E0%AC%AC%E0%AD%8D%E0%AD%9F%E0%AC%AC%E0%AC%B9%E0%AC%BE%E0%AC%B0%E0%AC%95%E0%AC%BE%E0%AC%B0%E0%AD%80:Doc_James/Tropical_sprue3&diff=428092&oldid=428088

This appears to solve the missing references issues.

Amire80 renamed this task from In ContentTranslation references are still going missing... to In ContentTranslation, some references are still going missing....May 15 2021, 6:15 PM
Amire80 updated the task description. (Show Details)
Amire80 added a subscriber: Amire80.

I tried to capture the problematic content in this test page. However, the published result includes all the references from the test page.

Does anyone have any suggestion about which content to include in the test page so that we can have a minimal reproducible test case?

Although I could not reproduce the main issue I noticed that:

  • The the published for the example has references with red errors, because references by name get duplicated. This is a known issue capture in T203772. We got details from the Editing team on how this can be solved, but @santhosh can confirm if we have everything that is needed, or further details may be needed about this.
  • One reference (Stat2020) in the original article had some mandatory parameters missing (the journal title). Content Translation shows a warning about it. This was motivated by community requests complaining of malformed templates when parameters could not be adapted (although in this case, the parameter is missing from the beginning, not lost in translation). I wonder if the issues persist when such parameter is filled in the translation tool (or in the source article). See images below:
Original referenceWarning in Content Translation
Screenshot 2021-05-26 at 11.38.17 2.png (834×1 px, 270 KB)
Screenshot 2021-05-26 at 11.26.00 2.png (588×1 px, 159 KB)

Here is an example

Source: https://en.wikipedia.org/wiki/User:Doc_James/CTX2

Results: https://ja.wikipedia.org/w/index.php?title=%E5%88%A9%E7%94%A8%E8%80%85:Doc_James/CTX2&action=edit

Not a single reference of the format "<ref name=ABC/>" comes through as such.

The full references come through. And the "<ref name=ABC/>" references occasionally come through after having a bunch of error text added to them (such as your example).

Here is a smaller amount of content were the ref do not come through

4 refs in the starting text https://en.wikipedia.org/wiki/User:Doc_James/CTX3

2 refs in the end text https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:Doc_James/CTX3

I found a reproducible test case. This example page contains 3 instances of the same reference. When it is translated from English to Japanese (using Google translate for each paragraph) the published result only contains 2 instances of the reference (i.e., one got lost).

The expected result would be for the three instances to be present in the published article.

Below you can see the source, translation and target screenshots. Notice that the reference next to the "kidney failure" link is only missing in the published content (see the red zone highlighted):

Source example contentSource and target in Content TranslationPublished content
Screenshot 2021-05-26 at 12.08.03 2.png (367×883 px, 126 KB)
Screenshot 2021-05-26 at 11.57.10 2.png (804×1 px, 199 KB)
Screenshot 2021-05-26 at 12.08.03 2 2.png (519×798 px, 156 KB)

(The published page also shows some reference errors because of the known issue of duplication of references by name: T203772: CX2: Improve support for references that are reused in multiple places)

Here is a smaller amount of content were the ref do not come through

4 refs in the starting text https://en.wikipedia.org/wiki/User:Doc_James/CTX3

2 refs in the end text https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:Doc_James/CTX3

Thanks! This is super useful. I'll add it in the description in addition to the one I reproduced, since sometimes apparently similar issues with references are caused by different underlying issues.

Pginer-WMF raised the priority of this task from Medium to High.May 28 2021, 9:17 AM

Change 701349 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Reference adaptaton: Support named references

https://gerrit.wikimedia.org/r/701349

Change 701349 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Reference adaptaton: Support named references

https://gerrit.wikimedia.org/r/701349

Change 704659 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2021-07-14-124232-production

https://gerrit.wikimedia.org/r/704659

Change 704659 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2021-07-14-124232-production

https://gerrit.wikimedia.org/r/704659

Mentioned in SAL (#wikimedia-operations) [2021-07-15T05:50:09Z] <kart_> Updated cxserver to 2021-07-14-124232-production (T282369, T284450)

Jpita added a subscriber: Jpita.

I now see all the references from @Doc_James example.

Can you confirm?

I made another test and it seems to be working. This is the published page showing the three instances of the reference, as expected:

ja.wikipedia.org_wiki_%E5%88%A9%E7%94%A8%E8%80%85_Pginer-WMF_T282369b-fixed(iPad) (1).png (1×2 px, 537 KB)

(the red text at the bottom is because of a separate issue: T203772: CX2: Improve support for references that are reused in multiple places)

Yah so it is no longer dropping the references. But is now duplicating the reference metadata with some duplications differing from the original :-(

You can see the same issue here today https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:Doc_James/Short_bowel_syndrome

What we at Wiki Project Med did was duplicate all the references before the mediawiki markup was put into content translation and than we have a bot that gets ride of the duplicate after. Not sure if we need a new ticket for this.

Yah so it is no longer dropping the references. But is now duplicating the reference metadata with some duplications differing from the original :-(

I think this issue is the one captured in a different ticket: T203772: CX2: Improve support for references that are reused in multiple places

Please let us know if you think it is not the same or some key detail may be missing. The process of de-duplicating the references is not trivial, but we plan to work in this area as part of the Content Translation maintenance work.

Yah that looks like the issue. Thanks