[extlink] parsing - link cannot contain language variant or extension tags
Open, LowPublic
Actions

Assigned To

None

Authored By

	Dan1wang
	May 2 2019, 4:00 AM

Description

In pegTokenizer.pegjs, we have

extlink_preprocessor_text_parameterized
  = r:( $[^'<~[{\n\r|!\]}\-\t&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
    / !inline_breaks s:( directive / no_punctuation_char / [&|{\-] ) { return s; }
    // !inline_breaks no_punctuation_char
    / $([.:,] !(space / eolf))
    / $(['] ![']) // single quotes are ok, double quotes are bad

and

directive
  = comment
  / extension_tag
  / tplarg_or_template
  / & "-{" v:lang_variant_or_tpl { return v; }
  / & "&" e:htmlentity { return e; }
  / include_limits

However, we cannot have language variant or extension tags in links. The code should be changed to

extlink_preprocessor_text_parameterized
  = r:( $[^'<~[{\n\r|!\]}\-\t&="' \u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]+
    / !inline_breaks s:( directive_in_extlink / no_punctuation_char / [&|{\-] ) { return s; }
    // !inline_breaks no_punctuation_char
    / $([.:,] !(space / eolf))
    / $(['] ![']) // single quotes are ok, double quotes are bad
    )+ {
        return tu.flattenString(r);
    }

and add

directive_in_extlink
  = !extension_tag
  comment
  / tplarg_or_template
  / & "&" e:htmlentity { return e; }
  / include_limits

Test commands:

echo '[http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}- google search]' | node bin/parse.js --normalize=parsoid --prefix=zhwiki
echo '[http://www.google.com/search?q=<nowiki>a</nowiki>]' | node bin/parse.js --normalize=parsoid
echo '[http://www.google.com/search?q=<pre>a</pre>]' | node bin/parse.js --normalize=parsoid

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Open	None	T43716 [EPIC] Support language variant conversion in Parsoid
Open	None	T222328 [extlink] parsing - link cannot contain language variant or extension tags
		· · ·

Event Timeline

Dan1wang created this task.May 2 2019, 4:00 AM

Restricted Application added subscribers: jeblad, Stang, Aklapper. · View Herald TranscriptMay 2 2019, 4:00 AM

Dan1wang added a project: Parsoid.May 2 2019, 4:00 AM

This is a weird corner case of legacy parser behavior; it's not entirely clear Parsoid should follow it. But it's not clear that what Parsoid is doing is totally consistent either.

(Setting $wgLanguageCode='zh':)

$ echo '[http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}- -{zh-cn:baidu;zh-tw:google}- search]' | php maintenance/parse.php 
<p><a rel="nofollow" class="external text" href="http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}-">baidu search</a>

$ echo '[http://www.google.com/search?q=<b>foo</b> search]' | php maintenance/parse.php 
<p><a rel="nofollow" class="external text" href="http://www.google.com/search?q="><b>foo</b> search</a>
</p>

$ echo '[http://www.google.com/search?q={{1x|<b>foo</b>}} search]' | php maintenance/parse.php 
<p><a rel="nofollow" class="external text" href="http://www.google.com/search?q="><b>foo</b> search</a>

vs Parsoid:

$ echo '[http://www.google.com/search?q=-{zh-cn:中国;zh-tw:中華民國}- -{zh-cn:baidu;zh-tw:google}- search]' | bin/parse.js  --normalize=parsoid --domain=zh.wikipedia.org
<p><a typeof="mw:ExpandedAttrs" rel="mw:ExtLink" href="http://www.google.com/search?q=" class="external text" data-mw='{"attribs":[[{"txt":"href"},{"html":"http://www.google.com/search?q=&lt;span typeof=\"mw:LanguageVariant\" data-mw-variant=&apos;{\"twoway\":[{\"l\":\"zh-cn\",\"t\":\"中国\"},{\"l\":\"zh-tw\",\"t\":\"中華民國\"}]}&apos;>&lt;/span>"}]]}'><span typeof="mw:LanguageVariant" data-mw-variant='{"twoway":[{"l":"zh-cn","t":"baidu"},{"l":"zh-tw","t":"google"}]}'></span> search</a></p>

$ echo '[http://www.google.com/search?q=<b>foo</b> search]' | bin/parse.js --normalize=parsoid
<p><a rel="mw:ExtLink" href="http://www.google.com/search?q=" class="external text"><b>foo</b> search</a></p>

$ echo '[http://www.google.com/search?q={{1x|<b>foo</b>}} search]' | bin/parse.js --normalize=parsoid
<p><a typeof="mw:ExpandedAttrs" rel="mw:ExtLink" href="http://www.google.com/search?q=foo" class="external text" data-mw='{"attribs":[[{"txt":"href"},{"html":"http://www.google.com/search?q=&lt;b typeof=\"mw:Transclusion\" data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"&amp;lt;b>foo&amp;lt;/b>\"}},\"i\":0}}]}&apos;>foo&lt;/b>"}]]}'>search</a></p>

$ echo '[http://www.google.com/search?q=<nowiki>foo</nowiki> search]' | bin/parse.js --normalize=parsoid
<p><a typeof="mw:ExpandedAttrs" rel="mw:ExtLink" href="http://www.google.com/search?q=" class="external text" data-mw='{"attribs":[[{"txt":"href"},{"html":"http://www.google.com/search?q=&lt;span typeof=\"mw:Nowiki\">foo&lt;/span>"}]]}'>search</a></p>

The legacy parser does template expansion before tokenization. This means that it stops matching the extlink as soon as it encounters a < *after template expansion*. The language converter syntax is an even more unusual special case: it is being preserved untouched so that variant conversion can transform it in a later stage -- but variant conversion skips attributes.

Parsoid does tokenization before template expansion. It grabs the template (but not the html tag), then stringifies the output, stripping the tags. This means that {{echo|<b>}} behaves differently from just <b> -- the latter splits the extlink token, the former doesn't.

In Parsoid Language conversion is not broken -- the information about the language converter markup is preserved in the mw:ExpandedAttrs property, and would presumably be localized to the correct variant and inserted into the href by the variant conversion stage. This is what would happen in core PHP if variant conversion reached into attribute values.

The proposed fix doesn't make Parsoid fully consistent with PHP, because of the different ways template contents are restringified. And it's not clear to me that we *want* to break language variants in the extlink; it's inconsistent with how templates are handled. But there's definitely some nonuniformity here that's worth thinking about.

test case: https://zh.wikipedia.org/wiki/User:Dan1wang/SandboxT222328

I don't know what "legacy parser" is, but it looks like pretty consistent except for case 4 and 5

Shizhao added a project: Chinese-Sites.May 6 2019, 3:05 AM

Dan1wang mentioned this in T222770: Parser rule clean-up: extlink_preprocessor_text_parameterized.May 8 2019, 9:26 AM

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Feb 15 2020, 12:05 AM

LGoto triaged this task as Low priority.Apr 17 2020, 4:25 PM

LGoto moved this task from Backlog to Known Differences on the Parsoid board.

cscott added a parent task: T43716: [EPIC] Support language variant conversion in Parsoid.Apr 17 2020, 4:25 PM

I don't know what "legacy parser" is

It's the core MediaWiki wikitext parser

Shizhao moved this task from Backlog to Extensions/Skins on the Chinese-Sites board.Jul 6 2020, 1:12 AM

Similarly T207710

[extlink] parsing - link cannot contain language variant or extension tagsOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

[extlink] parsing - link cannot contain language variant or extension tags
Open, LowPublic
Actions

Related Objects
Search...