Page MenuHomePhabricator

Potential bugs in SiteConfig.php regular expressions used in WikitextSerializer
Closed, ResolvedPublicBUG REPORT

Description

See fatal monitor from scandium to find a number of instances of this from recent roundtrip test runs. This error always manifests invalid offsets into the $pieces array in the loop (lines 1423, 1424, 1470 of WikitextSerializer.php).

Here are some possible test pages:

  • eswiki:Selecci%C3%B3n%20de%20f%C3%BAtbol%20de%20Guatemala
  • enwiki:Moons of Pluto
  • enwiki:/Von_Mises_yield_criterion

This may be reproducible by running the roundtrip-test.js script on scandium. You are unlikely to be able to reproduce this locally via the bin/parse.php script since the code paths involved in compute the regexps are different and may not have the bug. Alternatively, you can do a Special:Export of the pages from the production wikis and Special:Import those onto your local wiki and run tests locally.

Error message
/w/rest.php/es.wikipedia.org/v3/transform/pagebundle/to/wikitext/Selecci%C3%B3n%20de%20f%C3%BAtbol%20de%20Guatemala   ErrorException from line 1423 of /srv/deployment/parsoid/deploy/src/src/Html2Wt/WikitextSerializer.php: PHP Notice: Undefined offset: 50
#0 /srv/deployment/parsoid/deploy/src/src/Html2Wt/WikitextSerializer.php(1423): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/deployment/parsoid/deploy/src/src/Html2Wt/WikitextSerializer.php(1650): Parsoid\Html2Wt\WikitextSerializer->stripUnnecessaryIndentPreNowikis()
#2 /srv/deployment/parsoid/deploy/src/src/WikitextContentModelHandler.php(99): Parsoid\Html2Wt\WikitextSerializer->serializeDOM(DOMElement)
#3 /srv/deployment/parsoid/deploy/src/src/Parsoid.php(105): Parsoid\WikitextContentModelHandler->fromHTML(Parsoid\Config\Env, DOMDocument, NULL)
#4 /srv/deployment/parsoid/deploy/src/extension/src/Rest/Handler/ParsoidHandler.php(671): Parsoid\Parsoid->html2wikitext(MWParsoid\Config\PageConfig, Parsoid\PageBundle, array, NULL)
#5 /srv/deployment/parsoid/deploy/src/extension/src/Rest/Handler/TransformHandler.php(62): MWParsoid\Rest\Handler\ParsoidHandler->html2wt(Parsoid\Config\Env, array, string)
#6 /srv/mediawiki/php-1.34.0-wmf.21/includes/Rest/Router.php(270): MWParsoid\Rest\Handler\TransformHandler->execute()
#7 /srv/mediawiki/php-1.34.0-wmf.21/includes/Rest/Router.php(254): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\TransformHandler)
#8 /srv/mediawiki/php-1.34.0-wmf.21/includes/Rest/EntryPoint.php(82): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#9 /srv/mediawiki/php-1.34.0-wmf.21/includes/Rest/EntryPoint.php(68): MediaWiki\Rest\EntryPoint->execute()
#10 /srv/mediawiki/php-1.34.0-wmf.21/rest.php(28): MediaWiki\Rest\EntryPoint::main()
#11 /srv/mediawiki/w/rest.php(3): require(string)
#12 {main}

Event Timeline

ssastry renamed this task from parsoid emits: Html2Wt/WikitextSerializer.php:1423 PHP Notice: Undefined offset: 50 to Potential bugs in SiteConfig.php regular expressions used in WikitextSerializer.Sep 6 2019, 1:44 PM
ssastry triaged this task as Medium priority.
ssastry updated the task description. (Show Details)
ssastry edited Stack Trace. (Show Details)
ssastry removed Request URL.
ssastry removed Request ID.
ssastry updated the task description. (Show Details)
ssastry changed the subtype of this task from "Production Error" to "Bug Report".Sep 6 2019, 1:46 PM
ssastry moved this task from Backlog to Bugs, Notices, Crashers on the Parsoid-PHP board.

That is definitely a Wikimedia-production-error log! Not sure why you removed the project tag and the stacktrace :]

That is definitely a Wikimedia-production-error log! Not sure why you removed the project tag and the stacktrace :]

@hashar, it is not. This is Parsoid-PHP which has not been deployed to the general production cluster. This is from scandium which is in the production cluster but is used for round trip testing.

That is definitely a Wikimedia-production-error log! Not sure why you removed the project tag and the stacktrace :]

See T232042. I am going to remove the production error tag.

Those notices are currently the main source of errors reported in logstash for _type:mediawiki and all such occurrences are triaged under Wikimedia-production-error to get disposed off. The log does have prod looking URL such as http://es.wikipedia.org/w/rest.php/es.wikipedia.org/v3/transform/pagebundle/to/wikitext/Efecto%20fotoel%C3%A9ctrico

So I guess we need a configuration to have MediaWiki on scandium to NOT send logs to logstash since that is just for testing

So I guess we need a configuration to have MediaWiki on scandium to NOT send logs to logstash since that is just for testing

Correct. I filed T232042 for that. Do you know offhand how to do this or where we should look? We do want it in logstash, just not in the same channel used for production errors.

So, y'day I dumped the regexp on scandium on one of the test pages and here it is:

'/^((?:(?:#REDIRECT)[ \t\n\r\x0c]*(?::[ \t\n\r\x0c]*)?\[\[[^\]]+\]\])?(?:\[\[Category\:[^\]]*?\]\]|__(?:(?i:(?P<a_notoc>__NOTOC__)|(?P<a_nogallery>__NOGALLERY__)|(?P<a_forcetoc>__FORCETOC__)|(?P<a_toc>__TOC__)|(?P<a_noeditsection>__NOEDITSECTION__)|(?P<a_notitleconvert>__NOTITLECONVERT__)|(?P<b_notitleconvert>__NOTC__)|(?P<a_nocontentconvert>__NOCONTENTCONVERT__)|(?P<b_nocontentconvert>__NOCC__))|(?P<a_newsectionlink>__NEWSECTIONLINK__)|(?P<a_nonewsectionlink>__NONEWSECTIONLINK__)|(?P<a_hiddencat>__HIDDENCAT__)|(?P<a_expectunusedcategory>__EXPECTUNUSEDCATEGORY__)|(?P<a_index>__INDEX__)|(?P<a_noindex>__NOINDEX__)|(?P<a_staticredirect>__STATICREDIRECT__)|(?P<a_noglobal>__NOGLOBAL__)|(?P<a_disambiguation>__DISAMBIG__))__|\<\!\-\-\(\?\>\[\\s\\S\]\*\?\-\-\>\))*)(<nowiki>\s+<\/nowiki>)([^\n]*(?:\n|$))/Dim';

If you hardcode this regexp in WikitextSerializer.js at the appropriate line (1460-1470 range) , you can repro the bug with echo '<p> x</p><p> y</p>' | php bin/parse.php --html2wt.

The problem is all the ?P<a_notoc> etc groupings. So, the regexp should get rid of all these groups and instead use __NOTOC__|__NOGALLERY__|... probably by fetching the list of magic words and using it like the src/Config/Api/SiteConfig.php code does .. or some such thing.

This comment was removed by cscott.
[subbu@earth:~/work/wmf/parsoid] cat /tmp/bug.php
<?php

$re = '/^((?:(?:#REDIRECT)[ \t\n\r\x0c]*(?::[ \t\n\r\x0c]*)?\[\[[^\]]+\]\])?(?:\[\[Category\:[^\]]*?\]\]|__(?:(?i:(?P<a_notoc>__NOTOC__)|(?P<a_nogallery>__NOGALLERY__)|(?P<a_forcetoc>__FORCETOC__)|(?P<a_toc>__TOC__)|(?P<a_noeditsection>__NOEDITSECTION__)|(?P<a_notitleconvert>__NOTITLECONVERT__)|(?P<b_notitleconvert>__NOTC__)|(?P<a_nocontentconvert>__NOCONTENTCONVERT__)|(?P<b_nocontentconvert>__NOCC__))|(?P<a_newsectionlink>__NEWSECTIONLINK__)|(?P<a_nonewsectionlink>__NONEWSECTIONLINK__)|(?P<a_hiddencat>__HIDDENCAT__)|(?P<a_expectunusedcategory>__EXPECTUNUSEDCATEGORY__)|(?P<a_index>__INDEX__)|(?P<a_noindex>__NOINDEX__)|(?P<a_staticredirect>__STATICREDIRECT__)|(?P<a_noglobal>__NOGLOBAL__)|(?P<a_disambiguation>__DISAMBIG__))__|\<\!\-\-\(\?\>\[\\s\\S\]\*\?\-\-\>\))*)(<nowiki>\s+<\/nowiki>)([^\n]*(?:\n|$))/Dim';
$s = "<nowiki> </nowiki>x\n\n<nowiki> </nowiki>y";
$pieces = preg_split( $re, $s, -1, PREG_SPLIT_DELIM_CAPTURE);
error_log(json_encode($pieces));

[subbu@earth:~/work/wmf/parsoid] php /tmp/bug.php
["","","","","","","","","","","","","","","","","","","","","<nowiki> <\/nowiki>","x\n","\n","","","","","","","","","","","","","","","","","","","","<nowiki> <\/nowiki>","y",""]

The JS equivalent regexp is:

/^((?:(?:#REDIRECT|#redirect)[ \t\n\r\x0c]*(?::[ \t\n\r\x0c]*)?\[\[[^\]]+\]\])?(?:\[\[(?:Category)\:[^\]]*?\]\]|__(?:NOGLOBAL|DISAMBIG|NOCOLLABORATIONHUBTOC|nocollaborationhubtoc|NOTOC|notoc|NOGALLERY|nogallery|FORCETOC|forcetoc|TOC|toc|NOEDITSECTION|noeditsection|NOTITLECONVERT|notitleconvert|NOTC|notc|NOCONTENTCONVERT|nocontentconvert|NOCC|nocc|NEWSECTIONLINK|NONEWSECTIONLINK|HIDDENCAT|EXPECTUNUSEDCATEGORY|INDEX|NOINDEX|STATICREDIRECT)__|<!--(?:[^-]|-(?!->))*-->)*)(<nowiki>\s+<\/nowiki>)([^\n]*(?:\n|$))/im

Note that the PHP version, in addition to adding unnecessary ?P<a_notoc> groupings, is also doubling the leading and trailing underscores, so instead of matching __NOTOC__ the PHP version is looking for ____NOTOC____ etc.
The part of the regexp which matches the HTML comment is also over-escaped in PHP, so it's matching literal parens instead of grouping.

[subbu@earth:~/work/wmf/parsoid] cat /tmp/bug.php
<?php

$re = '/^((?:(?:#REDIRECT)[ \t\n\r\x0c]*(?::[ \t\n\r\x0c]*)?\[\[[^\]]+\]\])?(?:\[\[Category\:[^\]]*?\]\]|__(?:(?i:(?P<a_notoc>__NOTOC__)|(?P<a_nogallery>__NOGALLERY__)|(?P<a_forcetoc>__FORCETOC__)|(?P<a_toc>__TOC__)|(?P<a_noeditsection>__NOEDITSECTION__)|(?P<a_notitleconvert>__NOTITLECONVERT__)|(?P<b_notitleconvert>__NOTC__)|(?P<a_nocontentconvert>__NOCONTENTCONVERT__)|(?P<b_nocontentconvert>__NOCC__))|(?P<a_newsectionlink>__NEWSECTIONLINK__)|(?P<a_nonewsectionlink>__NONEWSECTIONLINK__)|(?P<a_hiddencat>__HIDDENCAT__)|(?P<a_expectunusedcategory>__EXPECTUNUSEDCATEGORY__)|(?P<a_index>__INDEX__)|(?P<a_noindex>__NOINDEX__)|(?P<a_staticredirect>__STATICREDIRECT__)|(?P<a_noglobal>__NOGLOBAL__)|(?P<a_disambiguation>__DISAMBIG__))__|\<\!\-\-\(\?\>\[\\s\\S\]\*\?\-\-\>\))*)(<nowiki>\s+<\/nowiki>)([^\n]*(?:\n|$))/Dim';
$s = "<nowiki> </nowiki>x\n\n<nowiki> </nowiki>y";
$pieces = preg_split( $re, $s, -1, PREG_SPLIT_DELIM_CAPTURE);
error_log(json_encode($pieces));

[subbu@earth:~/work/wmf/parsoid] php /tmp/bug.php
["","","","","","","","","","","","","","","","","","","","","<nowiki> <\/nowiki>","x\n","\n","","","","","","","","","","","","","","","","","","","","<nowiki> <\/nowiki>","y",""]

If I run a 1,$ s/(?P<[^>]*>\([^)]*\))/\1/g in vim (independent of the __ bug), i get the expected # of split matches.

Change 538112 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] WIP: refactor solTransparentWikitext{,NoWs}Regexp() into base class

https://gerrit.wikimedia.org/r/538112

Change 538112 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Refactor solTransparentWikitext{,NoWs}Regexp() into base class

https://gerrit.wikimedia.org/r/538112