Page MenuHomePhabricator

The default (English) link trail ($linkTrail) only matches [a-z]
Open, LowPublic

Description

Using a more extensive default $linkTrail regex was tried in 416c84480 (although I don't know why it used \p{L&} rather than [[:alpha:]] or \p{L}), but this was reverted in 5b97a5bb due to T17035: Link trail uses PHP 5.1 only feature. We could probably try that again (without the apostrophe, see T16655) now that we require a more modern version of PHP with a more modern version of PCRE. It may not work correctly for alphabets other than Latin.


Please have a look at
https://www.wikidata.org/wiki/MediaWiki:Gadget-rightsfilter/hu
where the link has the form [[:hu:Reguláris kifejezés|regex]]alapú
and "ú" is not linked.
This is a multilingual project, so all letters should be included into links, or at least all Hungarian letters. :-)

Details

Reference
bz45126

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:36 AM
bzimport set Reference to bz45126.
bzimport added a subscriber: Unknown Object (MLST).

This is link trails and it is part of the localization that would be very difficult to get right for everyone. In this case I would say put what you want to use for the page part into the link and do not depend on the link trails.

It could also be that the content language is still English in some cases, but it doesn't seems so form the content source.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lucie set Security to None.
matej_suchanek renamed this task from Non-English letters aren't linked to Non-English letters aren't linked on multilingual projects.Dec 22 2015, 12:53 PM
matej_suchanek updated the task description. (Show Details)
matej_suchanek removed a subscriber: Wikidata-bugs.

This seems to be problem on all multilingual projects...

For now the workaround is to place the text content of the link within the brackets and not expect that it will expand outside on touching letters (this only works on ASCII letters, not on non-ASCII letters that are UTF-8 encoded).

There's some regexp that controls how many characters to add to the link, but it just accepts ASCII digits [0-9], ASCII letters [A-Za-z] and a few "continuing punctuations" such as [_].

To cover non-ASCII only contents (including for English!) we should still not include all non-ASCII characters but only characters that are in some simple class. The problem is that regexps to match characters with UTF-8 can be fairly complex to write (but it's still possible to write it) and it is not a closed subset of the Unicode repertoire (because it is extensible: Unicode adds regularly letters, digits and some other useful symbols which could or could not be included such as emojis, we should however exclude whitespaces, but include all combining characters, plus a few controls such as CGJ/ZWJ/ZWNJ/WJ joiners needed in the standard orthography of some languages)
At each revision of Unicode, we would need to regenerate this matching regexp.

Unicode provides data for character subsets useful for identifiers (idstart, idcontinue) that we could use for matching these textual extensions of wikilinks (after the two closing brackets). It is easily recompilable into a UTF-8 matching regexp. Such regexp is already used internally by Javascript and HTML5 for recognizing their valid identifiers

However this could cause problems with languages that usually don't use whitespaces to separate words in their scripts (notably Chinese, Japanese, or Korean with their sinographic scripts, and a few other Asian scripts such as Thai): we should not extend links to all touching sinograms which would include a full sentence (not fully relevant: it would obscure the page with overlong links everywhere).

We could also break the link if there's a strong script change (e.g. any switch from CJK sinograms to Latin, including "wide" Latin characters used in CJK, or the reverse), but not for characters that are using weak scripts (e.g. ASCII digits, shared across many scripts): using regexps in this case would be more complex as we would need to use distinct regexp depending on the last strong script of the last character of the link text within the bracket)

So I think that writing "[[Link]]s" (or "[[Target|Link]]s") instead of "[[Target|Links]]" (or "[[Link|Links]]") is just an old (deprecated) trick of MediaWiki used by (lazy) wiki redactors, we cannot recommand using it:

The longer form will be preferred (where all characters to include in the link should be within the brackets), even if there's still some support for (ASCII-only) text link extensions (after closing brackets) to /[0-9A-Z_a-z]+/.

And I see no reason to handle this case to only to cover all extensions of the Basic Latin script (or even just for Hungarian), or even with other alphabetic scripts (Cyrillic, Greek) or abjads (Hebrew, Arabic) and some Indic abugidas (but not all of them use whitespace for word separation!), or simple syllabaries (such as Ethiopic, or Canadian Syllabics: should we include Hangul, Bopomofo and Katakana/Hiragana when they are mixed with signograms and now necessarily whitespace separated?).

OK. Thanks for the detailed explanation.

matmarex subscribed.

No, this is a perfectly valid bug. Unfortunately "link trails" are currently specified per-language, and the link trail for English doesn't include any accented characters. For example it's defined like this for Hungarian: $linkTrail = '/^([a-záéíóúöüőűÁÉÍÓÚÖÜŐŰ]+)(.*)$/sDu';, but these characters will only work on a Hungarian-language wiki.

matmarex renamed this task from Non-English letters aren't linked on multilingual projects to Non-English letters aren't linked on multilingual projects (because they're not included in link trail).Jul 27 2016, 12:59 PM
matmarex added subscribers: Thgoiter, TTO, Luke081515 and 5 others.

Essentially I think the linktrail for en should include all Latin-alphabet characters in the regex. Would /^(\p{Latin}+)(.*)$/ work?

Although, oddly enough, most uppercase letters are left out of the linktrails. For the Hungarian example above, only accented uppercase letters are matched, not regular ones. Most languages don't include any uppercase letters in their linktrail string. Is this strange behaviour something we want to preserve, or is it just a historical quirk?

"Is this strange behaviour something we want to preserve, or is it just a historical quirk?"
Or just survived tests, as a rare case.

Historical note: Using a more extensive regex was tried in 416c84480 (although I don't know why it used \p{L&} rather than [[:alpha:]] or \p{L}), but this was reverted in 5b97a5bb due to T17035: Link trail uses PHP 5.1 only feature. We could probably try that again (without the apostrophe, see T16655) now that we require a more modern version of PHP with a more modern version of PCRE.

In my opinion [[:alpha:]] or \p{L} would be wrong for Chinese and Thai at least ! they would produice overlong links sticked side by side everywhere and covering most contents in all paragraphs. In Wikipedia, almost all the text of paragraphs would be blue except the start of paragraphs without significant words.

Such link trails should contain only letters or digits in the same script as the last string script used within the inner text of the link and only if that script is alphabetic or an abjad, but probably not Indic abugidas and not any sinographic script (a good question is: can we extend some kanas after a kanji in the inner text of the link (I don't think so), but we can probably extend runs of kanas.

The Unicode rule of matching identifiers is not usable for this (we are not in the context of a programminc language but within a context of use by a natural language, so the punctuation/spacing rules are very different).

If we want a regexp, for now let's focus only in Latin, Greek, Cyrillic, Arabic, Hebrew, and possibly Hangul, breaking the run if there's any change of script (but not if these are "weak" digits belonging to all scripts). Extending the regexp for other scripts should be tested first (the hint here is to look at existing rules existing in CLDR for *word breakers* and look at the Unicode standard covering this complex aspect. And may be include also the line breakers to restrict more the link trails to get shorter runs.

However this should not be very language specific: it should better be script-specific.

Such link trails should contain only letters or digits in the same script as the last string script used within the inner text of the link and only if that script is alphabetic or an abjad, but probably not Indic abugidas and not any sinographic script (a good question is: can we extend some kanas after a kanji in the inner text of the link (I don't think so), but we can probably extend runs of kanas.

There are some rules for semi-sensible word breaking in CJK languages. VisualEditor implements some of them to be able to e.g. extend selection to the word around cursor when inserting an internal link. I'm not sure if they're simple enough to implement as a regexp; @dchan can probably advise.

Note als othat a few languages written in the Latin script also use "agglutination" for creating very long custom "words" by using many prefixes/inffixes/suffixes (and often some internal mutations). Such extension may break for example in Finnish or Hungarian (Hungarian was cited at top of this bug).

Any way I don't think that link trails are a very good idea: users can't control it easily (and should not have to include some arbitrarty separation toklen such as "nowiki/>" to limit these trails. Link trails were added only for lazy English writers that want to create links to a singural noun, by placing the final "s" of the displayed plural in the trail, or for placing some common suffixes ("ed", "s") of conjugated verbs (but it does not work when there are more complex mutations such as "*y -> *ies").

This concept of link trails does not even exist in HTML. And anyway with the Visual Editor, we much less need it now as it is simpler to edit links separately of their visible text content.

I can say that link trails work well in Polish, which has fairly complex conjugation and declension rules. I'd make a wild guess that it works in around 50% of cases (in English it's more like 99%, as you say). There is a very popular gadget on Polish Wikimedia projects that applies wikitext code cleanup, and one of the cleanups it does is converting things like [[Dog|dogs]] to [[dog]]s. I would be very sad if we wanted to get rid of them.

Nemo_bis renamed this task from Non-English letters aren't linked on multilingual projects (because they're not included in link trail) to The default (English) link trail ($linkTrail) only matches [a-z].Jul 31 2016, 8:26 AM
Nemo_bis updated the task description. (Show Details)

I don't think that "cleanup" operations such as automated replacement of [[Dog|dogs]] to [[dog]]s are useful. In fact they are counterproductive (and in fact unnecessary for the Visual Editor that should completely avoid using these link trails).

My opinion is we should deprecate these old link trails and in fact correct the other ways (link trails are just a minor facility for lazy English contributors using the Wiki code editor.

In fact, even the Wiki editor should better automatically convert these trails (only [a-z]) to regular links without them. It would also simplify the parsers for bots and analysers. Contributors should always use regular links, that are in fact more readable, and more explicit about the effective range of text covered by links.

For years it was corrected in the other direction. It would be mad to change this now.

Hungarian was cited because the current conservative linktrail rules break Hungarian words. \p{L} or \p{L&} would work well for Hungarian (I imagine the latter was used to avoid affecting CJK languages; no idea how well that works). Even \p{Latin} would be fine.

\p{Latin} would not be sufficient to avoid breaking words. There are languages needing other characters within Latin, notably with combining sequences. See for example Vietnamese that could use letters with two diacritics ! See also Chinese pinyin text, and some African languages.

In fact the minimum would require adding all combining characters (really needed to avoid breaking default grapheme clusters), all joiner controls, digits (including their wide versions used in CJK contexts because p{Latin} also includes wide letters), and a few punctuations (and the soft hyphen, but probably not the minus-hyphen).

Handling the ASCII apostrophe is complex (even more complex in MediaWiki due to its use to markup bold and italic styles: an initial parsing of these wiki syntax would be needed to know if these are apostrophes or markup...).

Note also that the apostrophe mays also be used as a quote (this is covered in more details in Unicode word breakers), and that this concerns not only the the ASCII apostrophe but also the rounded apostrophe.

Again, it is really needed to look at the Unicode standard :

  1. a MUST-HAVE in all cases, for all languages first requires not breaking any combining sequences before any characters with a non-zero combinining class and not even before CGJ), and to ensure that canonically equivalent strings will always break identically (needed for strict process conformance as required by the standard in all algorithms).
  2. the rules for default grapheme clusters breakers (this adds some unbreakable sequences for Latin text), these rules are language-neutral (there are a few rules to take for Latin text, notably "spacing diacritics", and joiners; we could exclude here some "letterlike" symbols that are not really Latin).
  3. the rules for word breakers (even if we consider only Latin-written languages): these rules however are language-sensitive (English is simple at this step but not all Latin-written languages, here there's the case for apostrophes, hyphens...)

In summary, it will be much more complex than just a regular expression: it requires a specific parser code (look at ICU implementation).

link trails are just a minor facility for lazy English contributors using the Wiki code editor.

Those links can be used in many languages. In German Wikipedia they are used exceedingly!