Page MenuHomePhabricator

Handling extension tags in HTML attributes (edge cases or otherwise)
Open, LowPublic

Description

T259676 is a production crasher. While https://gerrit.wikimedia.org/r/618565 fixes it, @Arlolra raised the question of why the extsrc property is missing in the first place.

Turns out that the wikitext in question is:

<i <ref>a</ref>>...

...</i>

The HTML i-tag is separate across a newline which breaks it across paragraph boundaries and then fixed by the tree builder which duplicates the HTML attribute which happens to contain the ref-tag.

To be clear, this looks like broken wikitext and so doesn't merit a lot of attention on its own. But in terms of consistent handling of scenarios like these, there are two questions to answer here:

  1. What is a sensible way to handle extension tags in HTML attribute positions? Typed templates / typed wikitext offers a clear strategy in the future ( i.e. enforce output constraints based on embedding context), but we need a solution before we get there.
  2. How we do handle tree builder fixup and HTML attributes of this nature?

I'll include a transcript of IRC conversation in a comment below but that conversation effectively raises the above 2 questions.

Event Timeline

ssastry renamed this task from Handling extension tags in HTML attributes (edge cases or oherwise) to Handling extension tags in HTML attributes (edge cases or otherwise).Aug 10 2020, 6:41 PM
ssastry triaged this task as Low priority.
ssastry moved this task from Needs Triage to Tech Debt / Big changes on the Parsoid board.

Change 654717 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] More papering over in References.php

https://gerrit.wikimedia.org/r/654717

Change 654717 merged by jenkins-bot:
[mediawiki/services/parsoid@master] More papering over in References.php

https://gerrit.wikimedia.org/r/654717

Change 655482 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a22

https://gerrit.wikimedia.org/r/655482

Change 655482 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a22

https://gerrit.wikimedia.org/r/655482

Change 849097 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Fix sealed fragment duplication on node cloning

https://gerrit.wikimedia.org/r/849097

Followed a trail of closed-as-duplicates here for the following, seen in 1.40.0-wmf.20:

Error
normalized_message
[{reqId}] {exception_url}   PHP Notice: Undefined index: mwf5
exception.trace
from /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Config/Env.php(852)
#0 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Config/Env.php(852): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Ext/ParsoidExtensionAPI.php(300): Wikimedia\Parsoid\Config\Env->getDOMFragment(string)
#2 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Ext/Cite/RefGroup.php(102): Wikimedia\Parsoid\Ext\ParsoidExtensionAPI->getContentDOM(string)
#3 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Ext/Cite/References.php(583): Wikimedia\Parsoid\Ext\Cite\RefGroup->renderLine(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI, Wikimedia\Parsoid\DOM\Element, stdClass)
#4 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Ext/Cite/References.php(632): Wikimedia\Parsoid\Ext\Cite\References::insertReferencesIntoDOM(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI, Wikimedia\Parsoid\DOM\Element, Wikimedia\Parsoid\Ext\Cite\ReferencesData, boolean)
#5 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Ext/Cite/RefProcessor.php(25): Wikimedia\Parsoid\Ext\Cite\References::insertMissingReferencesIntoDOM(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI, Wikimedia\Parsoid\Ext\Cite\ReferencesData, Wikimedia\Parsoid\DOM\Element)
#6 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(167): Wikimedia\Parsoid\Ext\Cite\RefProcessor->wtPostprocess(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI, Wikimedia\Parsoid\DOM\Element, array, boolean)
#7 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(868): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->Wikimedia\Parsoid\Wt2Html\{closure}(Wikimedia\Parsoid\DOM\Element, array, boolean)
#8 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(909): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->doPostProcess(Wikimedia\Parsoid\DOM\Element)
#9 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(927): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->process(Wikimedia\Parsoid\DOM\Element)
#10 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(180): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#11 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#12 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Wikitext/ContentModelHandler.php(124): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#13 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Parsoid.php(173): Wikimedia\Parsoid\Wikitext\ContentModelHandler->toDOM(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI)
#14 /srv/mediawiki/php-1.40.0-wmf.20/vendor/wikimedia/parsoid/src/Parsoid.php(256): Wikimedia\Parsoid\Parsoid->parseWikitext(MediaWiki\Parser\Parsoid\Config\PageConfig, Wikimedia\Parsoid\Config\StubMetadataCollector, array)
#15 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/Handler/ParsoidHandler.php(762): Wikimedia\Parsoid\Parsoid->wikitext2lint(MediaWiki\Parser\Parsoid\Config\PageConfig, array)
#16 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/Handler/ParsoidHandler.php(803): MediaWiki\Rest\Handler\ParsoidHandler->wtLint(MediaWiki\Parser\Parsoid\Config\PageConfig, array, NULL)
#17 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/Handler/TransformHandler.php(107): MediaWiki\Rest\Handler\ParsoidHandler->wt2html(MediaWiki\Parser\Parsoid\Config\PageConfig, array, NULL)
#18 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/Router.php(515): MediaWiki\Rest\Handler\TransformHandler->execute()
#19 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/Router.php(421): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\TransformHandler)
#20 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/EntryPoint.php(195): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#21 /srv/mediawiki/php-1.40.0-wmf.20/includes/Rest/EntryPoint.php(135): MediaWiki\Rest\EntryPoint->execute()
#22 /srv/mediawiki/php-1.40.0-wmf.20/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#23 /srv/mediawiki/w/rest.php(3): require(string)
#24 {main}

Another example https://en.wikisource.org/w/index.php?title=The_American_Revolution_(scriptural_style)&oldid=13013841 . The wikitext is the output of some OCR process and it a few incomplete HTML tags (such as <I).

The question of extension tags suggests that we also consider what happens when the <ref> is replaced by a {{#tag:ref|...}} -- that is, it is the extension tag *syntax* we are objecting to here or else the extension tag *semantics*. I'd argue that, semantics wise, we have always supported {{....}} brace expansion inside attribute values.

Change 849097 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix sealed fragment duplication on node cloning

https://gerrit.wikimedia.org/r/849097

Change 985169 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a10

https://gerrit.wikimedia.org/r/985169

Change 985169 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a10

https://gerrit.wikimedia.org/r/985169