Page MenuHomePhabricator

Case-insensitive parser functions/magic words should be deprecated
Open, Needs TriagePublic

Description

Quoting from https://www.b-list.org/weblog/2018/nov/26/case/:

...the complexity of case in Unicode means that case-insensitive comparisons should not be done using standard lowercasing or uppercasing functions common to many programming languages. For purposes of case-insensitive comparisons, Unicode provides the concept of case folding, and §3.13 of the Unicode Standard defines a toCaseFold case mapping and an isCaseFolded function.

It’s tempting to think of case folding as similar to lowercasing — and I call out Python’s documentation for this mistake in the talk — but it isn’t. The Unicode Standard cautions that a case folded string is not necessarily lowercase, and points out Cherokee as an example of a script where case folding produces a result containing uppercase characters.

Another aspect of case mapping and case folding is that it can be language sensitive. One example of this are Turkic languages written in the Latin script, where the proper case folding for I is "lowercase dotless i" ı which differs from the default "English" case folding for I which is "lowercase dotted i" i.

The bottom line is that, due to the complexities of proper localization: case-sensitive matching is recommended for matching syntactic content by Unicode: https://www.w3.org/TR/charmod-norm/#sec_case_sensitive

Further, an audit of case-insensitive magic words turned up a large number of issues: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/891896

Due to this, it is recommended that all new parser functions/magic words/syntactic constructs be specified as case-sensitive, and that existing case-insensitive magic words be deprecated if possible -- for example, if FOO exists as a case-insensitive syntactic token, often case-sensitive synonyms FOO, foo, and (perhaps) Foo can be added to cover the vast majority of usage, with the remaining cases linted away.

Event Timeline

Change #891896 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] WIP: audit magic words for anomalous parser function translations

https://gerrit.wikimedia.org/r/891896

Isn't this just returning the responsibility for case-folding to the programmer? (I wonder if there should be a Unicode character "Upper Case Dotless I"...)

Bugreporter2 subscribed.

OK but what do you intend to do with all the existing ones listed in https://www.mediawiki.org/wiki/Help:Magic_words ?