Page MenuHomePhabricator

[BUG] Wikidata description for the specific Chinese language variant should be shown
Closed, ResolvedPublicBUG REPORT

Description

Steps to reproduce
  1. Go to the article on China using Simplified Chinese and note the Wikidata description.
  2. Change to view the article in Traditional Chinese and note the Wikidata description.

Expected

The description that should be shown when viewing in Traditional Chinese should be the description from the "Traditional Chinese" row in the Wikidata entry for China, and the same for Simplified Chinese.

Actual

The description shown is pulling from the "Chinese" row in Wikidata, so there are characters being used and displayed in one variant when the language is set to the other. (In the example of the "China" article, there are Simplified characters in the description "中华人民共和国" showing on the Traditional character variant of the article)

Event Timeline

bearND subscribed.

MCS is not used for zhwiki until Parsoid and RESTBase can handle language variants.

So wait... Wikidata has descriptions in Traditional Chinese, Simplified Chinese, and another one called "Chinese"? What does that third one mean? Is it another variant all to itself, or does it "default" to traditional or simplified?

The China wikidata entry actually shows a different description for multiple variants, but only uses the "original" Chinese one.

image.png (1×2 px, 251 KB)

This can be better illustrated by taking an example Chinese wiki article with no description yet, which looked like this:

image.png (420×2 px, 123 KB)

After populating the description in Wikidata, the table is only updated for the Chinese only article:
image.png (170×2 px, 38 KB)

TL:DR; Looks like the behavior is similar to Simple English vs English in that a row can be created for the different variants in wikidata, but we are only pulling in whatever is in the 'Chinese' row without any character transforms being applied.

Here are the screenshots might help you:

  • Read an article in Traditional Chinese, it does not load the description from Wikidata; But, when you go to the Wikidata page, you can find the description has been published on it.
  • ==>
    • Android app PUSH the description in "zh-hant" language code to Wikidata ==> correct
    • Android app GET the description in "zh" language code from Wikidata ==> not correct

Traditional.png (1×2 px, 735 KB)

  • After update the description on Chinese label on the Wikidata, and then you will see the description after refreshing the Android app article page

Chinese.png (1×2 px, 795 KB)

When I tried to keep only Traditional Chinese label and description label on the Wikidata, the app did not show the description.

To repeat (and expand) relevant discussion from T177342:

Neither API currently provides support for specifying a language variant for Wikidata descriptions. Even the mobileview API's 'variant' parameter has no effect here. It could be added, but we'll run into the same issue as with T176678 that the mobileview API is deprecated and in principle we shouldn't be spending time on it.

For MCS/RESTBase, we're waiting on T159985. (Note that this in turn depends on T43716, which is triaged at low priority.)

For the mobileview API, this is the method I think we'd need to update for variant support: https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/27599dfbdaecf1ca0a12164e648a14facefd00d2/includes/MobileFrontend.body.php#L189-L209

As a side note, I get the sense no one would object to moving the mobileview API into the MobileApp extension as discussed on T176678, which would remove the need for the Reading Web team to be involved, though that would still leave an open question about how much work we should be putting into the mobileview API on an ongoing basis. (Also, such a change should probably be announced in advance on mobile-l and wikitech-l.)

Some of the problem here is that historically LanguageConverter does not specifically tag the source language variant of the text, since it is assumed it can be inferred from the character set. This is more-or-less true for Serbian (latin/cyrillic) and Chinese (simplified/traditional) but falls down badly with (say) British/American English. And it doesn't work 100% even for Serbian and Chinese, depending on the exact input text. The original article text in Wikipedia is a mix of variants, again with the assumption that you can determine on a word by word basis what the original variant is and what needs conversion.

Anyway, Parsoid is getting the ability to do language variant conversion, but we going forward we need to be careful to accurately record the source language variant -- for example, Wikidata should really be taking appropriate care and not following Mediawiki's (bad) example.

It doesn't seems to have any task to track Chinese language variant for Wikidata. But I'm not sure. I'd be grateful if anyone can link that or create a task.

This has also been a problem for importing data to Wikidata. I'm always confused by the difference of Chinese, Simplified and Traditional Chinese there.

I think we should:

  • remove Chinese in Wikidata
  • zh-cn, zh-sg falls back to zh-hans
  • zh-tw, zh-hk and zh-mo falls back to zh-hant.

In this way, we can map all language variants to Wikidata precisely without any other rules.

I agree with @fantasticfears 's comments. Allowing user to mark a label in zh is not accurate enough. I even thought of, in an aggressive perspective, we should use zh-cn/tw/hk/mo/sg instead of plain zh-hans/hant, when specifying the label and description for an entity. After all, aside from fallbacks (e.g. zh-tw --> zh-hant), wikidata can automatically use, e.g. zh-tw label when a user request for zh-hant label, but there is no zh-hant label directly assigned to this entity.

I want to help with this. Wikibase doesn't cover this now. It's much more reasonable to fix this on their end. Maybe you can chime in and ask Wikidata people? @RHo

hi @fantasticfears, have just tagged wikidata again for their comment first, seems it was removed after the original ticket was filed fsr...

Yup, it looks like this needs to be updated to also account for the user language variant, currently it just uses the site content language.
This description is used in onOutputPageParserOutput (not sure if the stuff there is cached), might require a cache split based on user content language variant (not sure if it is already split on that)?

Should be a pretty smallish patch to MobileFrontend

@Addshore I was actually proposing storage/data model. If I have some information about how the data gets rendered from DB, I might be able to submit patches.

Should I also add iOS-app-Bugs and Wikipedia-Android-App-Backlog ? Or, is this very bug also happened on iOS?

@Liuxinyu970226 Mm... I didn't use the IOS system, but I used Android. I think the result is the same.

IMG_20190304_082557.jpg (2×1 px, 374 KB)

@Liuxinyu970226 Sorry... I mean the result is the no conversion.

Isn't this task to merge the entries into the single "Chinese" entry??

Seems the merging is not quite suitable because there's some matters such as different names in different region such as the title of a movie, or a drama series.

At least please leave those variants alone as before, such as "zh-CN", "zh-HK", "zh-MO", "zh-TW", "zh-SG", "zh-MY" etc.

Wikidata_edit_labels_Chinese_only.PNG (815×1 px, 65 KB)

Those variants are better off with fallback chains as Mediawiki (that’s a nice system)

Restricted Application changed the subtype of this task from "Task" to "Bug Report". · View Herald TranscriptJul 31 2019, 8:19 PM

#product-infrastructure-team-backlog and Platform Engineering and cc @JoeWalsh

The issue is still there.

By requesting the following API, it should give the corresponding Wikidata description with the language code in Accept-Language.

https://zh.wikipedia.org/w/api.php?action=query&format=json&prop=description&titles=%E4%B8%89%E7%81%A3%E9%84%89%20(%E5%8F%B0%E7%81%A3)

According to the wikidata page: https://www.wikidata.org/wiki/Q713793

When sending Accept-Language: zh-hant, it should show 位於苗栗縣 in Traditional Chinese column.
When sending Accept-Language: zh-hans, it should show an empty description.

And we should no longer use the wikidata description in Chinese column.

@LGoto I would say to set it as Medium or High since it should not happen anyways.

@cooltey Just to be clear, am I correct that the remaining issue here is that the service should NOT fall back to using the description from another variant if the preferred variant is unavailable? Using your example title, I see that the description on https://zh.wikipedia.org/api/rest_v1/page/mobile-html/%E4%B8%89%E7%81%A3%E9%84%89_(%E5%8F%B0%E7%81%A3) uses the correct variant when Accept-Language: zh-hant is sent. But when Accept-Language: zh-hans is sent or no language variant is specified, the generic "Chinese" description is shown, which (IIUC) is incorrect.

If what you're asking is to fix ApiQueryDescription (?action=query&prop=description) to handle language variants generally, then that's a separate task, and one probably best performed by someone with in-depth knowledge of Wikibase concepts and architecture (hint: probably someone from WMDE and not WMF Product Infrastructure). But I wouldn't agree with the prioritization in that case, because descriptions in the correct variant are already available by other means.

Charlotte lowered the priority of this task from High to Medium.Jul 23 2020, 3:31 PM

If what you're asking is to fix ApiQueryDescription (?action=query&prop=description) to handle language variants generally, then that's a separate task, and one probably best performed by someone with in-depth knowledge of Wikibase concepts and architecture (hint: probably someone from WMDE and not WMF Product Infrastructure).

This API module is pretty far away from Wikibase (despite the code being in Wikibase currently).
The API module exists as part of a WMF Product feature (see T184000), so probably doesn't need to be coordinated with us (WMDE) much.

Mholloway claimed this task.

The request in this task was for the Wikipedia Android app to show the article description in the correct language variant if a description exists in Wikidata for that variant. That's long-since resolved. Please file requests for refining the current behavior (or for fixing ApiQueryDescription) as new tasks.