Page MenuHomePhabricator

DOMException: Invalid Character Error
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
labels.normalized_message
[{reqId}] {exception_url}   DOMException: Invalid Character Error
error.stack_trace
from /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/DOM/DOMBuilder.php(250)
#0 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/DOM/DOMBuilder.php(250): DOMDocument->createAttribute(string)
#1 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/DOM/DOMBuilder.php(324): Wikimedia\RemexHtml\DOM\DOMBuilder->createNode(Wikimedia\RemexHtml\TreeBuilder\Element)
#2 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/TreeBuilder/TreeBuilder.php(277): Wikimedia\RemexHtml\DOM\DOMBuilder->insertElement(integer, Wikimedia\RemexHtml\TreeBuilder\Element, Wikimedia\RemexHtml\TreeBuilder\Element, boolean, integer, integer)
#3 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/TreeBuilder/TreeBuilder.php(257): Wikimedia\RemexHtml\TreeBuilder\TreeBuilder->insertForeign(string, string, Wikimedia\RemexHtml\Tokenizer\LazyAttributes, boolean, integer, integer)
#4 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/TreeBuilder/InBody.php(456): Wikimedia\RemexHtml\TreeBuilder\TreeBuilder->insertElement(string, Wikimedia\RemexHtml\Tokenizer\LazyAttributes, boolean, integer, integer)
#5 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/TreeBuilder/InCell.php(40): Wikimedia\RemexHtml\TreeBuilder\InBody->startTag(string, Wikimedia\RemexHtml\Tokenizer\LazyAttributes, boolean, integer, integer)
#6 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/TreeBuilder/Dispatcher.php(420): Wikimedia\RemexHtml\TreeBuilder\InCell->startTag(string, Wikimedia\RemexHtml\Tokenizer\LazyAttributes, boolean, integer, integer)
#7 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/Tokenizer/Tokenizer.php(1479): Wikimedia\RemexHtml\TreeBuilder\Dispatcher->startTag(string, Wikimedia\RemexHtml\Tokenizer\LazyAttributes, boolean, integer, integer)
#8 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/Tokenizer/Tokenizer.php(595): Wikimedia\RemexHtml\Tokenizer\Tokenizer->handleAttribsAndClose(integer, string, boolean, integer)
#9 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/Tokenizer/Tokenizer.php(369): Wikimedia\RemexHtml\Tokenizer\Tokenizer->dataState(boolean)
#10 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/remex-html/src/Tokenizer/Tokenizer.php(178): Wikimedia\RemexHtml\Tokenizer\Tokenizer->executeInternal(boolean)
#11 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/parsoid/src/Utils/DOMUtils.php(61): Wikimedia\RemexHtml\Tokenizer\Tokenizer->execute(array)
#12 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/parsoid/src/Core/PageBundle.php(62): Wikimedia\Parsoid\Utils\DOMUtils::parseHTML(string)
#13 /srv/mediawiki/php-1.41.0-wmf.30/vendor/wikimedia/parsoid/src/Parsoid.php(471): Wikimedia\Parsoid\Core\PageBundle->toHtml()
#14 /srv/mediawiki/php-1.41.0-wmf.30/includes/parser/Parsoid/LanguageVariantConverter.php(166): Wikimedia\Parsoid\Parsoid->pb2pb(MediaWiki\Parser\Parsoid\Config\PageConfig, string, Wikimedia\Parsoid\Core\PageBundle, array)
#15 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/Handler/ParsoidHandler.php(1200): MediaWiki\Parser\Parsoid\LanguageVariantConverter->convertPageBundleVariant(Wikimedia\Parsoid\Core\PageBundle, Wikimedia\Bcp47Code\Bcp47CodeValue, NULL)
#16 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/Handler/ParsoidHandler.php(1071): MediaWiki\Rest\Handler\ParsoidHandler->languageConversion(MediaWiki\Parser\Parsoid\Config\PageConfig, array, array)
#17 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/Handler/TransformHandler.php(155): MediaWiki\Rest\Handler\ParsoidHandler->pb2pb(array)
#18 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/Router.php(517): MediaWiki\Rest\Handler\TransformHandler->execute()
#19 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/Router.php(422): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\TransformHandler)
#20 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/EntryPoint.php(195): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#21 /srv/mediawiki/php-1.41.0-wmf.30/includes/Rest/EntryPoint.php(135): MediaWiki\Rest\EntryPoint->execute()
#22 /srv/mediawiki/php-1.41.0-wmf.30/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#23 /srv/mediawiki/w/rest.php(3): require(string)
#24 {main}
Impact
Notes

2403 of these seen in logs starting 12:16:12 UTC, all for zhwiki, all for 1.41.0-wmf.30.

Details

Request URL
https://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle/%E7%84%A1%E7%90%86%E6%95%B8
Related Changes in Gerrit:

Event Timeline

Presumably this is the result of some patches being applied for T346657

We have about 140 langconv (pb2pb) reqs/s but not sure how many are for zhwiki titles. But, these are about 1 every 7 second with multiple errors per request which means absolute # of error titles are smaller. But multiple errors per title is confusing me unless Remex is catching these errors, logging them, and recovering.

All error requests are for createAttribute (specifically this: $attrNode = $this->doc->createAttribute( $this->coerceName( $attr->localName ) ); but coerceName is supposed to deal with unsupported characters already, so that is also confusing me).

Lastly, this is definitely from the last train since the errors started then -- but not immediately obvious what change triggered this.

To be investigated.

Oh, this phab task is for 1.41.0-wmf.30 not 1.42.0-wmf.1 as I originally thought ... so the error manifested even before wmf.1 was rolled out!

And, this is for zhwiki, so, this would have called core's zhwiki language conversion routines before passing that on to Parsoid for metadata conversion. So, the HTML is being parsed to DOM on this converted HTML. So, that is probably the source of it.

But, Parsoid disabled its native zhwiki converter in 1.41.0-wmf.29 .. so, the next question is what happened with wmf.30 (and close to wmf.1 deploy

https://logstash.wikimedia.org/goto/941bb37ae180c5e40788c21032676121 shows that this started close to 1.42.0-wmf.1 deploy but before that was deployed and only started Oct 19th. Need to look if something to backported / swatted before wmf.1 rollout last week.

Separately, independent of what caused this once we roll out https://gerrit.wikimedia.org/r/c/mediawiki/core/+/967275 and its parent Parsoid patch on the next train, we will no longer parse the zhwki HTML on the Parsoid side just go generate metadata -- so this will disappear then, but still curious what happened last week.

It is baffling .. the first error was about 12:10 UTC and https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2023-10-19 shows nothing for tht timeframe. group2 rollout was a full 6 hours later and there were about ~3600 errors logged already by then.

Based on @cscott's hunch, I explored more and it turns out 16 pages account for all the errors seen so far. We wonder if these errors were triggered by a template edit (causing these pages to *now* go through Parsoid with the new code paths from 3 weeks back).

PHP 7.4 is calling xmlValidateName() which is throwing the error. I'm guessing root cause here is that there is some newly-added wikitext which has some character which (a) passes Remex's attribute characters check but (b) fails libxml2's version of the same.

Presumably this is the result of some patches being applied for T346657

So, Arlo's hunch is likely correct here even though it triggered almost 2 weeks after.

I validated Remex's HTMLData:$nameCharConvTable and HTMLData:$nameStartCharConvTable against https://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name and they match up. So Remex is indeed correctly coercing to the /xml spec/.

libxml2, however, is validating against something completely different -- and presumably ancient. php-src/ext/dom/attr.c calls xmlValidateName() from libxml2, throwing the 'Invalid Character' error if that fails. xmlValidateName is defined in libxml2:tree.c, and after some initial fast path "in case the name is entirely ascii", falls through to try_complex to implement the full validation:

    if ((!IS_LETTER(c)) && (c != '_') && (c != ':'))
	return(1);
    cur += l;
    c = CUR_SCHAR(cur, l);
    while (IS_LETTER(c) || IS_DIGIT(c) || (c == '.') || (c == ':') ||
	   (c == '-') || (c == '_') || IS_COMBINING(c) || IS_EXTENDER(c)) {
	cur += l;
	c = CUR_SCHAR(cur, l);
    }

Just starting with the "start character" test, which is IS_LETTER(c) || (c == '_') || (c == ':'), IS_LETTER is defined in parserInternals.h as IS_BASECHAR(c) || IS_IDEOGRAPHIC(c). IS_IDEOGRAPHIC is fairly simple, effectively [0x4E00-0x9FA5] || 0x3007 || [0x3021 - 0x3029], but IS_BASECHAR for characters above U+00FF is (from chvalid.c):

static const xmlChSRange xmlIsBaseChar_srng[] = { {0x100, 0x131},
    {0x134, 0x13e}, {0x141, 0x148}, {0x14a, 0x17e}, {0x180, 0x1c3},
    {0x1cd, 0x1f0}, {0x1f4, 0x1f5}, {0x1fa, 0x217}, {0x250, 0x2a8},
    {0x2bb, 0x2c1}, {0x386, 0x386}, {0x388, 0x38a}, {0x38c, 0x38c},
    {0x38e, 0x3a1}, {0x3a3, 0x3ce}, {0x3d0, 0x3d6}, {0x3da, 0x3da},
    {0x3dc, 0x3dc}, {0x3de, 0x3de}, {0x3e0, 0x3e0}, {0x3e2, 0x3f3},
    {0x401, 0x40c}, {0x40e, 0x44f}, {0x451, 0x45c}, {0x45e, 0x481},
    // etc

This has a huge number of holes compare to the xml name specification, which begins [0xf8 - 0x2ff ][ 0x370 - 0x37d ][ 0x37f - 0x1fff ]... for this region. The first gap is at U+0132 and U+0133 aka aka [IJij] and indeed this causes a crash in both both --wt2html and --html2wt directions:

echo '<span data-ij="foo">bar</span>' | php bin/parse.php --wt2html # or --html2wt

Given that the libxml2 validity check is *so* broken, it's not entirely clear what to do -- other than hold our breath for PHP to eventually update it. The Remex coerceName function can probably be extended based on the actual tables from libxml2 to escape "more characters" and avoid the crash, but the resulting 'conv table' to give to mb_encode_numericentity is going to be quite long given the number of holes in the libxml2 IS_BASECHAR() definition.

We are going to sweep this under the carpet at least for this scenario as noted in T349310#9277286 .. but yes, if wikis use attribute names that trip up against this, such pages will not render. For now, we hold our breath till PHP's new DOM implementation comes online and the WMF cluster upgrades to that version (which is probably a few years away). Given that this is the first time we ran against this, I am comfortable with that approach.

Change 968357 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/libs/RemexHtml@master] Coerce names harder, given libxml2 limits on names

https://gerrit.wikimedia.org/r/968357

This the the #2 most common error reported by logspam-watch this week.

The patch above should fix remex, although we'll need to do a library upgrade dance w/ mediawiki-vendor to get this deployed.

For future reference, the script I used to generate the set of characters accepted by libxml2 from its internal representation of this set is:


I don't expect libxml2 to ever change, but if it did somehow you could use this script as a starting point to regenerate its list of accepted name characters.

For now, we hold our breath till PHP's new DOM implementation comes online and the WMF cluster upgrades to that version (which is probably a few years away).

I've not generally found breath-holding to be a useful strategy when it comes to getting new features into PHP.

I tried building the new HTML5 PR. It uses libxml2 and suffers from the same problem.

php > $doc = DOM\HTMLDocument::createEmpty();
php > $doc->createElement('data-ij');

Warning: Uncaught DOMException: Invalid Character Error in php shell code:1
Stack trace:
#0 php shell code(1): DOM\Document->createElement('data-\xC4\xB3')
#1 {main}
  thrown in php shell code on line 1

It's possible to create such elements using the parser:

php > $doc = DOM\HTMLDocument::createFromString('<data-ij></data-ij>', LIBXML_NOERROR);
php > print $doc->saveHTML();
<html><head></head><body><data-ij></data-ij></body></html>

Although that lack of validation also existed in previous versions of PHP, even when loading XML documents.

I don't expect libxml2 to ever change, but if it did somehow you could use this script as a starting point to regenerate its list of accepted name characters.

Upstream libxml2 is being actively maintained. It has had lots of commits in the last month. Merge requests submitted to their gitlab are being reviewed. PHP supports the 2.11 branch of libxml2 which was released in April.

I did some benchmarks. mb_encode_numericentity() has no clever optimisations like a table cache or a tree. With n being the size of the plain ASCII input, on my laptop,

  • Before: (0.17 + 0.017 n) µs
  • After: (0.75 + 0.15 n) µs

It's not significant in the context of a full parsing pipeline.

Change 968357 merged by jenkins-bot:

[mediawiki/libs/RemexHtml@master] Coerce names harder, given libxml2 limits on names

https://gerrit.wikimedia.org/r/968357

ssastry claimed this task.

Looks like this error has gone away after the train rolled out with the Parsoid & core changes that prevents this error from triggering. So, I am going to resolve this. But, the Remex change still needs to be packaged and deployed. That will prevent future bugs like this in the future.

Change 1011395 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] WIP: Update wikimedia/remex-html to 4.1.0

https://gerrit.wikimedia.org/r/1011395

Change 1011396 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Update wikimedia/remex-html to 4.1.0

https://gerrit.wikimedia.org/r/1011396

Change 1011395 merged by jenkins-bot:

[mediawiki/vendor@master] Update wikimedia/remex-html to 4.1.0

https://gerrit.wikimedia.org/r/1011395

Change 1011396 merged by jenkins-bot:

[mediawiki/core@master] Update wikimedia/remex-html to 4.1.0

https://gerrit.wikimedia.org/r/1011396