Page MenuHomePhabricator

ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too
Closed, ResolvedPublic

Description

Author: vlakoff

Description:
Hello,

ISBN, RFC and PMID "magic links" might support non-breaking spaces ( ) too, like:

ISBN 978-1-2345-6789-0

see also:
http://www.mediawiki.org/wiki/Markup_spec/BNF/Magic_links


Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:29 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz28950.
bzimport added a subscriber: Unknown Object (MLST).

Might as well do it when fixing bug 29025
(a literal 0160 char, not when done with entities)

vlakoff wrote:

patch proposal for Bug 28950 and Bug 29025

First patch proposal for Bug 28950 and Bug 29025. Seems to be working great, nevertheless any suggestion would be very welcome.

The benefits of this patch are:

  • (Bug 28950) non-breaking spaces (both literal char and HTML entities) support
  • (Bug 29025) no surprising link creation if several \n's (like "ISBN\n\n1234567890")

The only limitation I am aware of is that \n isn't implemented (yet), so for example "ISBN\n1234567890" doesn't produce a link. But don't forget cases like "ISBN \n123...", "ISBN\n 123..." (<pre> insertion!), "ISBN\n&nbsp;123...", and so on.

\n support is feasible, I don't know if it would be that useful, however I'd like to be as close as possible to "normal" wikicode parsing.

attachment magic_links.diff ignored as obsolete

sumanah wrote:

I've added the "patch" and "need-review" tags so developers know to review this. Thanks for the patch! Next time you can do that yourself to speed things up.

vlakoff wrote:

patch proposal (v2) for Bug 28950 and Bug 29025

New patch proposal for Bug 28950 and Bug 29025.

Now also perfectly manages \n's :-)

Here's my approach: the \n's are captured by the regex, as does the production code, but no link is created if the spaces between "ISBN" and "123..." met any of these conditions:

  • several \n's (not necessarily consecutive); which forcibly leave the current <p>
  • \n followed by a normal space; which trigger <pre> creation

Also, I removed the following lines I added in my previous patch:
$spaces = preg_replace( '![ \t]+!', ' ', $spaces );
$spaces = preg_replace( '!&\#0?160;|\x{0160}!u', '&nbsp;', $spaces );

I noticed the parser usually doesn't do this cleanup (it only replaces entities with &#160;), so for consistency and performances I just let the spaces unmodified.

Please inform me of anything I would have forgotten.

Sumana, you're welcome; I'm keeping it in mind for the next time.

Attached:

The entities look a bit like cluttering. Also, why do you need to
replace the spaces?

vlakoff wrote:

I'm not sure to understand what you mean?

I just support the different ways of providing a non-break space in the wikisource, which are the entities &nbsp;, &#160;, &#0160;, as well as the literal \u00A0 char. It is important for consistency to recognize all of them and not just an arbitrary part.

Also, I said I don't replace anymore the spaces ;-) I realized after my first patch that it was useless...

wicke wrote:

The latest patch would remove linking for five links in the English Wikipedia found by the following regexp:

'(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b))'


Match: [[Ridda wars]]

on, Trident Press Ltd., 2001, p. 81-84. ISBN

1-900724-47-2.</ref>

Buzakha

On receiving i

Match: [[Hiroyuki Agawa]]

Imperial Navy<br>ISBN 9780870113550<br>ISBN

9784770025395

[[Biography]]; translation by John Bes

Match: [[USS Fulton (AS-11)]]

Ships'', pg. 377. Amber Books, London. ISBN

9781905704439 </ref>, launched on 27 December 1940 by

Match: [[Leinster League Division Two]]

1997/1998 Ashbourne

1998/1999 Coolmine RFC

1999/2000 [[Garda RFC|Garda]]

2000/2001 Ark

Match: [[Bluff Cove Air Attacks]]

sic primer''. Diane publishing, p. 235. ISBN

1585660914</ref>

A total of 56 British servicemen

This is rare enough (and can easily be fixed manually). In the patch, the checks for the number of newlines and the leading space (pre) could also be folded into the regexp. Apart from that, it would be a good idea to benchmark the performance impact from switching the regexp to unicode mode.

vlakoff wrote:

Thank you for this deep research. These articles should be corrected manually. Also, your 4th match seems to be wrong ;)

About the unicode regex mode, if the performance impact is worth it, we could remove that "\x{00A0}" part. It shouldn't be a big problem since directly using the nbsp char is a bad idea: not visible, and removed from textarea by some browsers.

wicke wrote:

I have not tested the performance impact yet, but have doubts about promoting manual non-breaking space insertion.

An alternative to consider would be to insert non-breaking spaces automatically in these links, which would avoid cluttering the source with ugly HTML entities. The length of IDs and especially of RFC/ISBN/PMID prefixes is quite short, so a non-breaking space should not cause major display problems, even on small displays. ISBN links in narrow table columns (infoboxes maybe) could still be an issue, and would need to be checked.

Independent of the non-breaking space issue I am in favor of restricting the number of newlines between link prefix and ID just as you did in your patch. Could you rework the patch to include the count directly in the regexp, similar to my regexp above?

wicke wrote:

The trunk version of this regexp already uses Unicode regexp mode, so your patch should not affect performance significantly.

The current patch no longer applies on current trunk though. Could you refresh and tweak it?

sumanah wrote:

(In reply to comment #10)

The current patch no longer applies on current trunk though. Could you refresh
and tweak it?

Thus marking "reviewed" - thanks, Gabriel.

Of the 5 exceptions I have corrected 3, one was corrected already (but there was an instance of a tab character, and one was a false positive (Rugby Football Club, not Request For Comments).

audreyt wrote:

Hi Vlakoff, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser
into a "Parsoid" prototype component, to support the rich-text Visual Editor
project:

https://www.mediawiki.org/wiki/Parsoid
https://www.mediawiki.org/wiki/Visual_editor

Folks interested in enhancing the parser's capabilities are very much welcome
to join the Parsoid project, and contribute patches as Git branches:

https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier
for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which
can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a
comment.

If you run into any issues with the patch process, please feel free to ask on
irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!

Patch at https://gerrit.wikimedia.org/r/133651

I've also got parsoid patches for this behavior in progress, but I'll make the changes to parsoid only after we've pushed the change to PHP land and parser tests.

Change 133651 had a related patch set uploaded (by Cscott):
Don't allow embedded newlines in magic links, but do allow &nbsp;

https://gerrit.wikimedia.org/r/133651

Patch-For-Review

Change 133651 merged by jenkins-bot:
Don't allow embedded newlines in magic links, but do allow &nbsp;

https://gerrit.wikimedia.org/r/133651