Page MenuHomePhabricator

Wikitext editors don’t honor language settings or language variants
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Go to Mandarin Wikipedia and click “View Source” or “Edit” on any random page.
  • Inspect #wpTextbox1

What happens?:

The lang attribute is set to zh regardless of language variant settings.

On pages with <html lang="zh-Hant-TW">, this redundant <textarea lang="zh"> attribute would override CJK rendering on wikitext editors and break punctuation renderings.

What should have happened instead?:

It should follow page variant or user settings.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

1.43.0-wmf.7 (b6ef248)

Other information (browser name/version, screenshots, etc.):

The current wikitext editor declares unnecessary lang HTML attribute on its internal source editor. For example, on MediaWiki, it’s en regardless of your language setting; on Mandarin Wikipedia, it’s zh regardless of your variant setting.

This seemingly harmless redundancy actually breaks editing experience the Mandarin Wikipedia – CJK fonts require proper lang attribute to determine which set of characters and punctuation rules to apply, which is why Mandarin Wikipedia has a very delicate system on distinguishing and converting, say, zh-Hans-MY, zh-Hant-HK and zh-Hant-TW. (Adobe explained this difference very clearly in the README file of Source Han Sans.)

Upon further inspection, it seems that MediaWiki\EditPage\TextBoxBuilder is the cause of this problem. The current behavior overrides the root <html lang> setting and makes the editor a pain to read. It should either copy the attribute from root <html> element or just remove this redundancy altogether.

Event Timeline

I'm not an expert in this area, you might want to ask someone else for advice.

That said, I think that this is intentional, because the software allows you to write the content of the pages in a mix of language variants. I'm not able to provide an example in Chinese, but here is one in Serbian:

In this case, the software uses lang="sr" for the textbox in both pages because it doesn't detect which variant the content is written in. (And it's allowed to mix them within one article.)

lang="sr-Cyrl" and lang="sr-Latn" are only used for the converted output, for example:

I understand that this may be a bigger problem for you, but I don't know how to improve it without making the situation worse in other languages.

@matmarex Thanks for the swift reply!

I guess what makes the Serbian example works is because it’s one language written in two different scripts, bearing different Unicode codepoints. It makes sense not to second guess which variant the content is written in (from what I read from the source code comments, that seems very problematic).

The situation in Mandarin Chinese is a bit tricky due to Han Unification. CJK languages share the same Unicode codepoints for similar characters, even when they’re written in different scripts (zh-Hans, zh-Hant). When lang="zh" is specified—since the user agent has no way to know which variant to use and Simplified Chinese users are usually of majority—zh-Hans-CN rules apply.

(Imagine if all Cyrillic scripts were unified and all lang="cyrl" text would be rendered in cyrl-ru unless cyrl-sr was specified. It would be very messed up.)

I guess there might be a few ways to solve this particular issue:

  1. we could remove the lang attribute on the text editor altogether, letting the document lang kick in;
  2. we could check if the page language matches the current UI language, and skip the extra lang attribute if the two are the same;
  3. we could make this a Chinese language-specific issue and make special treatment for lang="zh".

I would say (2) would be more preferable but (3) shall be also fine. What do you think?

I don't think this is plausibly fixable given the point of MediaWiki's "language converter" script transliteration tool; the content might more properly be marked up as zh-hani, but I doubt that will work as expected either?

At least in theory, Special:PageLanguage allows you to mark the page as being written in e.g. zh-tw instead of the default zh (it uses the nonstandard codes for some reason), and that should probably be reflected in the lang attributes in the wikitext editor. I haven't tested how well that works, or how it interacts with the language converter.

override CJK rendering on wikitext editors and break punctuation renderings

@RSChiang, could you provide a screenshot and example URL? Like, which punctuation to be specific?