Page MenuHomePhabricator

Default $linkPrefixCharset is too broad
Open, Needs TriagePublic

Description

The link prefix feature is enabled in 13 languages (ar, cu, cv, hy, is, ka, kaa, lbe, ln, mzn, pnb, uk, uz) via $linkPrefixExtension = true; in Messages<Code>.php.

This feature pulls characters immediately before a link into the label, e.g. be[[holden]] -> [[holden|beholden]].

The link suffix feature, which is enabled by default in all languages, has a default pattern of just letters a-z. Other languages extend this to include their alphabets.

The link prefix feature has a much broader set of characters, being a-z or any character above 0x80 in Unicode. While this includes every letter in every non-Latin alphabet, it also includes punctuation and various non-letters. It is also inconsistent with the link suffix behaviour.

This means effective list of characters that aren't matched is : !"#$%&'()*+,-./:;<=>?@[\]^_\{|}~, plus digits 0-9 and basic whitespace.

Some examples of inconsistencies this creates:

  • be[[holden]] correctly matches
  • @[[User:foo]] correctly doesn't match
  • ▶[[User:foo]] incorrectly matches
  • $[[1000]] correctly doesn't match
  • £[[1000]] incorrectly matches (The A in ASCII is American, so they only included '$' :) )
  • ¿[[Que]]? the ¿ incorrectly matches, but the ? doesn't.

Some languages have set $linkPrefixCharset to something sensible, e.g. Icelandic matches it to their $linkTrail (suffix) character set, such only letters and hypens will be matched.

However many languages use the overly broad default character set (presumably because it includes their alphabet, and implementors didn't bother to test if it included undesirable characters).

We should audit these 13 languages and restrict the prefix character set (by default to be the same as the suffix character set).

We will probably need to warn affected communities as some existing link labels may change (although visually any changes will be small and easily fixable).

Once this is done, we can reset to the default to null so that in future languages turning on this feature have to explicitly define a character set.

References

Status

Default prefix = 'a-zA-Z\\x{80}-\\x{10ffff}'
Default linkTrail = '/^([a-z]+)(.*)$/sD'

LanguagelinkTrailPrefixNotes
ar Arabica-zء-يDefaultShould be fixed to match linkTrail
cu Church Slavica-zабвгде<!--truncated-->ќуўџэ҄я“»]+„«Appears to be deliberate set to just quotation marks. Suffix setting is broader. Can be left alone
cv Chuvasha-zа-яĕçăӳ"»a-zA-Z"\\x{80}-\\x{10ffff}Default + ", should be fixed to match linkTrail but with inverted quotes.
hy Armeniana-zաբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆև«»DefaultShould be fixed to match linkTrail
is Icelandicáðéíóúýþæöa-z-–áÁðÐéÉíÍóÓúÚýÝþÞæÆöÖA-Za-z–-Nothing to fix!
ka Georgiana-zაბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰ“»DefaultShould be fixed to match linkTrail
kaa Karakalpak[a-zıʼ’“»]'(?!')a-zıA-Zİ\\x80-\\xffProbably ok to be left as-is, but could include opening quotes as close quotes are in suffix. Not sure the 80-ff range is required.
lbe Laka-zабвгдеёжзийклмнопрстуфхцчшщъыьэюяӀ1“»DefaultShould be fixed to match linkTrail
ln LingalaDefaultDefaultShould be fixed to match linkTrail (which is also empty). Not sure this was turned on deliberate so could also be turned off.
mzn MazanderaniDefaultDefaultShould probably be turned off as linkTrail is only set to a-z and this language doesn't use Latin alphabet.
pnb Western PunjabiDefaultDefaultSame as mzn
uk Ukrainiana-zабвгґдеєжзиіїйклмнопрстуфхцчшщьєюяёъы“»„«Appears to be deliberate set to just quotation marks. Suffix setting is broader. Can be left alone
uz Uzbeka-zʻʼ“»a-zA-Z\\x80-\\xffʻʼ«„Okay to be left as is. Not sure the 80-ff range is required.

Event Timeline

Once all the languages listed as "Default" above are fixed, we can remove the default value in MessagesEn.php and instead throw an error if a $linkPrefixCharset is not set while $linkPrefixExtension is enabled.