Page MenuHomePhabricator

[extlink] parsing - link cannot contain language variant or extension tags
Open, Needs TriagePublic

Description

In pegTokenizer.pegjs, we have

extlink_preprocessor_text_parameterized
  = r:( $[^'<~[{\n\r|!\]}\-\t&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
    / !inline_breaks s:( directive / no_punctuation_char / [&|{\-] ) { return s; }
    // !inline_breaks no_punctuation_char
    / $([.:,] !(space / eolf))
    / $(['] ![']) // single quotes are ok, double quotes are bad

and

directive
  = comment
  / extension_tag
  / tplarg_or_template
  / & "-{" v:lang_variant_or_tpl { return v; }
  / & "&" e:htmlentity { return e; }
  / include_limits

However, we cannot have language variant or extension tags in links. The code should be changed to

extlink_preprocessor_text_parameterized
  = r:( $[^'<~[{\n\r|!\]}\-\t&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
    / !inline_breaks s:( directive_in_extlink / no_punctuation_char / [&|{\-] ) { return s; }
    // !inline_breaks no_punctuation_char
    / $([.:,] !(space / eolf))
    / $(['] ![']) // single quotes are ok, double quotes are bad
    )+ {
        return tu.flattenString(r);
    }

and add

directive_in_extlink
  = !extension_tag
  comment
  / tplarg_or_template
  / & "&" e:htmlentity { return e; }
  / include_limits

Test commands:

echo '[http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}- google search]' | node bin/parse.js --normalize=parsoid --prefix=zhwiki
echo '[http://www.google.com/search?q=<nowiki>a</nowiki>]' | node bin/parse.js --normalize=parsoid
echo '[http://www.google.com/search?q=<pre>a</pre>]' | node bin/parse.js --normalize=parsoid

Event Timeline

Dan1wang created this task.May 2 2019, 4:00 AM
Restricted Application added subscribers: jeblad, Cosine02, Aklapper. · View Herald TranscriptMay 2 2019, 4:00 AM
cscott added a subscriber: cscott.May 2 2019, 1:50 PM

This is a weird corner case of legacy parser behavior; it's not entirely clear Parsoid should follow it. But it's not clear that what Parsoid is doing is totally consistent either.

(Setting $wgLanguageCode='zh':)

$ echo '[http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}- -{zh-cn:baidu;zh-tw:google}- search]' | php maintenance/parse.php 
<p><a rel="nofollow" class="external text" href="http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}-">baidu search</a>

$ echo '[http://www.google.com/search?q=<b>foo</b> search]' | php maintenance/parse.php 
<p><a rel="nofollow" class="external text" href="http://www.google.com/search?q="><b>foo</b> search</a>
</p>

$ echo '[http://www.google.com/search?q={{1x|<b>foo</b>}} search]' | php maintenance/parse.php 
<p><a rel="nofollow" class="external text" href="http://www.google.com/search?q="><b>foo</b> search</a>

vs Parsoid:

$ echo '[http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}- -{zh-cn:baidu;zh-tw:google}- search]' | bin/parse.js  --normalize=parsoid --domain=zh.wikipedia.org
<p><a typeof="mw:ExpandedAttrs" rel="mw:ExtLink" href="http://www.google.com/search?q=" class="external text" data-mw='{"attribs":[[{"txt":"href"},{"html":"http://www.google.com/search?q=&lt;span typeof=\"mw:LanguageVariant\" data-mw-variant=&apos;{\"twoway\":[{\"l\":\"zh-cn\",\"t\":\"中国\"},{\"l\":\"zh-tw\",\"t\":\"中華民國\"}]}&apos;>&lt;/span>"}]]}'><span typeof="mw:LanguageVariant" data-mw-variant='{"twoway":[{"l":"zh-cn","t":"baidu"},{"l":"zh-tw","t":"google"}]}'></span> search</a></p>

$ echo '[http://www.google.com/search?q=<b>foo</b> search]' | bin/parse.js --normalize=parsoid
<p><a rel="mw:ExtLink" href="http://www.google.com/search?q=" class="external text"><b>foo</b> search</a></p>

$ echo '[http://www.google.com/search?q={{1x|<b>foo</b>}} search]' | bin/parse.js --normalize=parsoid
<p><a typeof="mw:ExpandedAttrs" rel="mw:ExtLink" href="http://www.google.com/search?q=foo" class="external text" data-mw='{"attribs":[[{"txt":"href"},{"html":"http://www.google.com/search?q=&lt;b typeof=\"mw:Transclusion\" data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"&amp;lt;b>foo&amp;lt;/b>\"}},\"i\":0}}]}&apos;>foo&lt;/b>"}]]}'>search</a></p>

$ echo '[http://www.google.com/search?q=<nowiki>foo</nowiki> search]' | bin/parse.js --normalize=parsoid
<p><a typeof="mw:ExpandedAttrs" rel="mw:ExtLink" href="http://www.google.com/search?q=" class="external text" data-mw='{"attribs":[[{"txt":"href"},{"html":"http://www.google.com/search?q=&lt;span typeof=\"mw:Nowiki\">foo&lt;/span>"}]]}'>search</a></p>

The legacy parser does template expansion before tokenization. This means that it stops matching the extlink as soon as it encounters a < *after template expansion*. The language converter syntax is an even more unusual special case: it is being preserved untouched so that variant conversion can transform it in a later stage -- but variant conversion skips attributes.

Parsoid does tokenization before template expansion. It grabs the template (but not the html tag), then stringifies the output, stripping the tags. This means that {{echo|<b>}} behaves differently from just <b> -- the latter splits the extlink token, the former doesn't.

In Parsoid Language conversion is not broken -- the information about the language converter markup is preserved in the mw:ExpandedAttrs property, and would presumably be localized to the correct variant and inserted into the href by the variant conversion stage. This is what would happen in core PHP if variant conversion reached into attribute values.

The proposed fix doesn't make Parsoid fully consistent with PHP, because of the different ways template contents are restringified. And it's not clear to me that we *want* to break language variants in the extlink; it's inconsistent with how templates are handled. But there's definitely some nonuniformity here that's worth thinking about.

test case: https://zh.wikipedia.org/wiki/User:Dan1wang/SandboxT222328

I don't know what "legacy parser" is, but it looks like pretty consistent except for case 4 and 5