Page MenuHomePhabricator

linktrail for digraphs wih apostrophe or grave accent
Closed, ResolvedPublic

Description

Author: alefzet

Description:
[[:en:Karakalpak language]] uses digraphs with apostrophes like A', N', O', U'
[[:en:Uzbek language]] uses digraphs with gravis like G`, O` and apstrophe (') as separated letter.
See [[:en:Alphabets derived from the Latin]]

r36253 introduced $linkTrail = '/^(\'?\p{L&}+)(.*)$/usD'; that works as well for many languages but Karakalpak and Uzbek.

[[a'bc]]de becomes <u>a'bcde</u>
[[abc]]'de => <u>abc'de</u>
[[abc]]d'e => <u>abcd</u>'e rather than <u>abcd'e</u>

[[a`bc]]de becomes <u>a`bcde</u>
[[abc]]de => <u>abc</u>de rather than <u>abc`de</u>
[[abc]]d`e => <u>abcd</u>e rather than <u>abcde</u>

I am not expert on regular expressions. What may create correct regex?

May be need introduce a new variable with specifed punctuation characters (dependent to language) treated as letters/letter elements?


Version: 1.13.x
Severity: enhancement

Details

Reference
bz14539

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:09 PM
bzimport set Reference to bz14539.
bzimport added a subscriber: Unknown Object (MLST).

Wait? What is the issue?

A) [[Foo]]Bar's does not link the 's currently.
B) ` should be included as a punctuation character.
C) [[Abc]]'de includes the 'de as part of the link but it shouldn't in Karaklpak or Uzbek.

A ToDo of mine was to move the definition of the flat character pattern for things that match a letter into a constant. That way we can create slightly altered $linkTrails for some languages which need special exceptions, like » in Ba which can't be included cause it would break other languages, and also to simplify the creation of more complex regexes.

If the issue is A), that's a ToDo of mine, the more complex regex noted above was one which uses the character classes twice to allow for a single inclusion of ', but not restricted to the start.

If the issue is B), that's the reason why I created [http://lists.wikimedia.org/pipermail/wikitech-l/2008-June/038323.html this thread] in wikitech-l and it would be helpful there to create a good list, and also get input on what characters should be common and what ones shouldn't. (The intent is to make linkTrails as language independent as possible).

If the issue is C), then I can fix that by working on the note above, and creating a definition for those two languages but without the punctuation.

alefzet wrote:

Issue mentioned above. Please read carefully

[[abc]]d'e => <u>abcd</u>'e (incorrect) rather than <u>abcd'e</u> (correct)

[[abc]]de => <u>abc</u>de (incorrect) rather than <u>abc`de</u> (correct)
[[abc]]d`e => <u>abcd</u>e (incorrect) rather than <u>abcde</u> (correct)

Solutions may be
a. include ' and ` as characters in regex, or
b. introduce special variable in Messages file per language that adding above characters to regex (when need)

Oh right, need to close this one.
Bug 14655 ended up causing me to remove the feature of ' being included inside of the LinkTrail in r36693.