Page MenuHomePhabricator

Comma separated lists should always use unicode-bidi: isolate
Open, Needs TriagePublic

Description

Look at this example from a set of blocks I did using CheckUser on fawiki today.

Screen Shot 2020-05-12 at 1.00.44 PM.png (106×2 px, 79 KB)

And compare it with this:

Screen Shot 2020-05-12 at 1.00.55 PM.png (116×2 px, 78 KB)

The language is Persian, which is right-to-left. The comma character is ، and should appear to the left of each username, without any spaces to its right, and with a single space to its left.

Top view is the current output. Bottom view is when I manually applied unicode-bidi: isolate CSS rule to all of the <a> tags for those usernames listed.

That text is generated using the checkuser-block-success message. The comma separated list itself comes from applying Language::listToText() to an array of HTML tags (see code). What listToText() currently does is to just glue the items together using a localized comma except for the last two items which are glued using the localized "and" phrase (see current code). What is missing is to ensure that the individual elements are wrapped inside a tag that has the correct unicode-bidi attribute.

This is a major i18n issue, impacting MW core and all extensions, not just CheckUser. The key questions are: should listToText() wrap the individual elements, or should we hunt down all usages and make sure that those pass in an array of already-wrapped HTML tags. Also, where should the CSS be?

Event Timeline

@Amire80 can I ask you to opine on this?

The only way to create a valid list of user names (or autonyms for language names, or lists of translated terms), is to "isolate" each displayed name (inside a "bdi" element, or using equivalent Unicode controls in plain text). Then you can use the comma you want and other separators, and keep the list logically ordered, and each item can contain arbitrary text (in any script and direction).

Note that the "bdi" element is supported by Mediawiki. It is the element of choice to use to encapsulate the arbitrary content generated by any template or parser function (use it preferably to "span" or "div", because a "bdi" can safely contain any inline or block element(s) without implicitly creating extra block element). The "bdi" element already has the implicit "unicode-bidi: isolate" CSS style, you never need to specify this style, just use this HTML element, which does not require any dir="ltr/rtl" attribute and should not even contain one as the internal direction should be automatically determined and should not be forced, and this dir will not even affect the direction of what is preceding the start or following the end of of the "bdi" element itself).

The "idea" of forcing the "unicode-bidi" on HTML "a" elements for links is very bad (in fact in Mediawiki you can't even use this prohibited element). So just insert <bdi>...</bdi> elements around the displayed text (the user name only, or the full wikilink between [[PAGENAME|text]] or [[PAGENAME]] containing this user name, or the full external link between [URI text], or the full content generated by a template or parser function call like {{TEMPLATE|optional parameters|...}} or {{#PARSERFUNCTION:optional parameters|...}}) and your problem is instantly solved, without any modification of MediaWiki.

In all cases, using Bidi-overrides (RLM/LRM U+200E/200F, or "bdo" elements acting like U+202D/202E, or "unicode-bidi: bidi-override" in CSS) is harmful. This should be strongly deprecated. These are valid only in the middle of a known text (which is itself isolated in its own document, or its own block element like "p", "div", "li", "td", "th", "caption", "blockquote"...) and they should never occur at end of any text which is not explicitly "isolated" or terminated by an explicit end of block).

Interesting report about UBA v2 support in browsers:
https://caniuse.com/#search=unicode-bidi

Only IE (which si no longer supported since years) does not support UBAv2 (however there were some independant plugins/patches to fix it, not made by Microsoft). It has been replaced by Edge, itself updated recently to use the Webkit engine instead of the legacy engine which had broken support for overrides, and was only supported in Windows 10. Webkit has support of UBAv2 since 2011. Almost all users of browsers in Windows, and Linux have updated browsers, or no longer use these OSes for webbrowsing, but only for running legacy/proprietary apps, sometimes isolated in a VM (for security reasons), and their users have another OS.

The only remaining ones are those using old versions of Android 4 (which is already no longer supported by Google Play and most apps, msot of these phones were not even capable of displaying many web pages, due to lack of memory resources), and these smartphones are now almost all dead (battery out of life) and if they still work, they are extremely slow (they also no longer have any market value, nobody repairs them, because it's cheaper to buy a new phone, even an entry model). A vast majority of Android users have Android 6 at least.

And it is possible to create a Javascript or server-side converter that will resolve bidi-isolates into legacy bidi-overrides, by parsing the web page so that the display is correct on these old legacy browsers.

Wrapping in <bdi> is also a reasonable approach. For the reasons you mentioned above, it may be superior to setting unicode-bidi of each <a> tag.

One approach can be to add a second argument to Language::listToText() and Language::commaList functions called $bdiWrap and set it to true by default. This would wrap each list element in <bdi>, but also allow the use cases to override by setting $bdiWrap to false.

It seems like we have around 80 uses cases of listToText() and around 130 use cases of commaList(). Another approach can be to go through every single one of them and add the <bdi> tags there, in which case $bdiWrap will not be necessary.

A third approach is to set the default value of $bdiWrap to false and only set it to true in place we know it has been causing error (e.g. CheckUser as described in the task). In some use cases, what is passed to listToText() or commaList() is a list of localized messages, in which case the <bdi> is not necessary. As an example, the list of actions taken by an AbuseFilter is shown as a comma-separated list in Special:Abuselog but action names are always localized so this line of code would not need to inject <bdi> tags.

Overall, I think the third approach is the most reasonable; it centralizes the work of adding <bdi> tags and allows it to be optionally used only when needed.

Note that using "bdi" HTML elements is not the only solution.

If you want to allow "raw format in plain text, you can also use Bidi controls:

  • U+2068 FIRST STRONG ISOLATE (FSI), at start
  • U+2029 POP DIRECTIONAL ISOLATE (PDI), at end

These controls also work in HTML browsers that support "bdi" elements! The main difference is that they will be part of a single text element, which cannot be styled separately from the rest of the surrounding text, but it is still interesting to format a multilingual text using a consistant style (same set of fonts, same line-height, consistant metrics for Latin letters depending if they follow non-Latin characters from different scripts: look at how Latin letters can change when it follows CJK sinograms or Arabic, or Burmese and note that the same Latin letters may be used in different languages frequently mixing scripts: Chinese, Japanese, Arabic, Hebrew, Khmer...). Those changes of metrics and appearance for Latin are annoying.


And you can as well encode the language code inside the FSI...PDI sequence:

  • U+E0001 LANGUAGE TAG, just after FSI, followed by Unicode TAG characters in U+E0020..U+E007E (remapping the ASCII bytes of the language code, which normally includes only basic Latin letters [a-z], or [A-Z] but case is not significant, ASCII digits [0-9], and ASCII hyphens or underscores; you may also want to normalize the ASCII code to use only lowercase letters and hyphens instead of underscores)
  • then the normal Unicode text
  • E+E007F CANCEL TAG, at end just before PDI

Note that these Unicode language tag characters are zero-width (normally invisible in browsers, except when using a "visible controls" rendering mode) and Bidi neutral. They were once "deprecated", but this is no longer the case because their use as been reallowed and reapproved for other uses (notably they are also used *after* emojis to change their apperance, or create variants for national/regional flags when the "regional indicators" are not sufficient because they work only in pairs mapped on ISO 3166-1 only; they have also been proposed to encode emoji variants based on Wikidata IDs "Qnnn", in fact "regional indicators" could be as deprecated to use "language tag" characters instead and allow arbitrary codes for all flags, or variants of other emojis like "speak bubbles", "country flags", flying flags, by encoding them after a "blank" emoji base)


Finally you can also optionally force the inner content of the isolate to use a specific default direction (like with the dir="rtl/ltr" attribute of "bdi" elements, which is *optional*, its absence being that the default direction of the "bdi" element is "auto" and determined by the first character that has a *strong* direction).

For that purpose, you can use LRM or RLM controls at start of the isolate, but generally this is undesirable (e.g. for user names or page names when you cannot really guess what is their intended rendering direction); as well the use of the dir="ltr/rtl" attribute should generally be avoided in "bdi" elements (the same is true with the use of CSS style direction: rtl/ltr which have the same effect as they are also overrides; this CSS style is what is used in browsers to remap the dir="ltr/rtl" attribute).

The LRM/RLM Unicode controls (U+200E/U+200F) are overrides which should only be present inside the actual text values, only at start or before at least one base character which is not zero-width, but never at end of these values.