Page MenuHomePhabricator

PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page
Open, MediumPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   PHP Notice: Undefined offset: 1634118181
exception.trace
from /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/remex-html/src/TreeBuilder/CachingStack.php(235)
#0 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/remex-html/src/TreeBuilder/CachingStack.php(235): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/remex-html/src/TreeBuilder/TreeBuilder.php(178): Wikimedia\RemexHtml\TreeBuilder\CachingStack->push(Wikimedia\RemexHtml\TreeBuilder\Element)
#2 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/remex-html/src/TreeBuilder/Dispatcher.php(345): Wikimedia\RemexHtml\TreeBuilder\TreeBuilder->startDocument(Wikimedia\RemexHtml\Tokenizer\Tokenizer, string, string)
#3 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/remex-html/src/Tokenizer/Tokenizer.php(172): Wikimedia\RemexHtml\TreeBuilder\Dispatcher->startDocument(Wikimedia\RemexHtml\Tokenizer\Tokenizer, string, string)
#4 /srv/mediawiki/php-1.38.0-wmf.3/includes/tidy/RemexDriver.php(84): Wikimedia\RemexHtml\Tokenizer\Tokenizer->execute(array)
#5 /srv/mediawiki/php-1.38.0-wmf.3/includes/parser/Parser.php(1703): MediaWiki\Tidy\RemexDriver->tidy(string, array)
#6 /srv/mediawiki/php-1.38.0-wmf.3/includes/parser/Parser.php(656): Parser->internalParseHalfParsed(string, boolean, boolean)
#7 /srv/mediawiki/php-1.38.0-wmf.3/includes/cache/MessageCache.php(1312): Parser->parse(string, Title, ParserOptions, boolean)
#8 /srv/mediawiki/php-1.38.0-wmf.3/includes/language/Message.php(1334): MessageCache->parse(string, Title, boolean, boolean, Language)
#9 /srv/mediawiki/php-1.38.0-wmf.3/includes/language/Message.php(931): Message->parseText(string)
#10 /srv/mediawiki/php-1.38.0-wmf.3/includes/language/Message.php(983): Message->format(string)
#11 /srv/mediawiki/php-1.38.0-wmf.3/includes/skins/Skin.php(698): Message->parse()
#12 /srv/mediawiki/php-1.38.0-wmf.3/includes/skins/SkinMustache.php(74): Skin->printSource()
#13 /srv/mediawiki/php-1.38.0-wmf.3/skins/Vector/includes/SkinVector.php(380): SkinMustache->getTemplateData()
#14 /srv/mediawiki/php-1.38.0-wmf.3/includes/skins/SkinMustache.php(56): SkinVector->getTemplateData()
#15 /srv/mediawiki/php-1.38.0-wmf.3/skins/Vector/includes/SkinVector.php(327): SkinMustache->generateHTML()
#16 /srv/mediawiki/php-1.38.0-wmf.3/includes/skins/SkinTemplate.php(144): SkinVector->generateHTML()
#17 /srv/mediawiki/php-1.38.0-wmf.3/includes/OutputPage.php(2669): SkinTemplate->outputPage()
#18 /srv/mediawiki/php-1.38.0-wmf.3/includes/exception/MWExceptionRenderer.php(158): OutputPage->output()
#19 /srv/mediawiki/php-1.38.0-wmf.3/includes/exception/MWExceptionRenderer.php(77): MWExceptionRenderer::reportHTML(Wikimedia\RequestTimeout\RequestTimeoutException)
#20 /srv/mediawiki/php-1.38.0-wmf.3/includes/exception/MWExceptionHandler.php(107): MWExceptionRenderer::output(Wikimedia\RequestTimeout\RequestTimeoutException, integer)
#21 /srv/mediawiki/php-1.38.0-wmf.3/includes/exception/MWExceptionHandler.php(202): MWExceptionHandler::report(Wikimedia\RequestTimeout\RequestTimeoutException)
#22 /srv/mediawiki/php-1.38.0-wmf.3/includes/exception/MWExceptionHandler.php(170): MWExceptionHandler::handleException(Wikimedia\RequestTimeout\RequestTimeoutException, string)
#23 [internal function]: MWExceptionHandler::handleUncaughtException(Wikimedia\RequestTimeout\RequestTimeoutException)
#24 {main}
Impact
Notes

Details

Request URL
https://fr.wikipedia.org/w/rest.php/fr.wikipedia.org/v3/page/pagebundle/Utilisateur%3AWrangel%2Fsuivi/181379799

Related Objects

StatusSubtypeAssignedTask
OpenPRODUCTION ERRORNone
OpenNone
ResolvedDzahn
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedLadsgroup
ResolvedMoritzMuehlenhoff
Resolvedjijiki
ResolvedMoritzMuehlenhoff
ResolvedTrizek-WMF
ResolvedDzahn
Resolved Gilles
ResolvedDzahn
ResolvedRequestPapaul
Resolvedjijiki
DeclinedNone
ResolvedDzahn
ResolvedDzahn
ResolvedPapaul
ResolvedCmjohnson
ResolvedRequestCmjohnson
ResolvedRequestPapaul
ResolvedAndrew
ResolvedArielGlenn
ResolvedDzahn
ResolvedLegoktm
ResolvedPapaul
ResolvedDzahn
Declined Gilles
ResolvedVolans
ResolvedDzahn
ResolvedLegoktm
ResolvedPleaseStand
OpenNone
OpenJoe
OpenJoe
ResolvedArielGlenn
OpenNone
Opentstarling
OpenJdforrester-WMF
OpenJdforrester-WMF
ResolvedLegoktm
ResolvedLegoktm
OpenJdforrester-WMF
ResolvedDaimona
ResolvedDaimona
ResolvedJoe

Event Timeline

Arlolra triaged this task as Medium priority.Oct 21 2021, 10:15 PM
Arlolra moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.
Arlolra added a project: Excimer.
Arlolra added a subscriber: Arlolra.

This is kind of interesting here because the crasher isn't in Parsoid but in trying to generate the output page for the exception handler in MediaWiki.

We have a "The maximum execution time of 60 seconds was exceeded" firing during the parse,

from /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(97)
#0 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(72): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->onTimeout(integer)
#1 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/parsoid/src/Config/Env.php(424): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->Wikimedia\RequestTimeout\Detail\{closure}(integer)
#2 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/parsoid/src/Wt2Html/Grammar.php(61): Wikimedia\Parsoid\Config\Env->getSiteConfig()
#3 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/wikipeg/src/PEGParserBase.php(287): Wikimedia\Parsoid\Wt2Html\Grammar->initialize()
#4 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/parsoid/src/Wt2Html/Grammar.php(14788): Wikimedia\WikiPEG\PEGParserBase->initInternal(string, array)
#5 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(169): Wikimedia\Parsoid\Wt2Html\Grammar->parse(string, array)
#6 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(196): Wikimedia\Parsoid\Wt2Html\PegTokenizer->tokenizeSync(string, array)
#7 /srv/mediawiki/php-1.38.0-wmf.3/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(237): Wikimedia\Parsoid\Wt2Html\PegTokenizer->tokenizeAs(string, string, boolean)
...

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.10.13?id=iVMIeXwB1RMACsoSpzcj

and then when the skin tries to use Remex when displaying a message, it seems to be corrupted.

There's a similar weirdness here T254210#7208515

Krinkle added a subscriber: Krinkle.

If I understand correctly, you suspect that the execution timeout interrupted work where Parsoid was interacting with a Remex instance. And then later when MediaWiki renders the error page for this execution timeout, and rendering an interface message which is then tidied with Remex, fails because the same Remex instance is used which is corrupt. I think I got this wrong since it seems odd that the same Remex state would be shared across content parse and interface message, and across Parsoid and MW core.

Some ideas:

  • Consider use of a "critical section" (provided by MediaWikiServices, via wikimedia/request-timeout lib, powered by php-excimer in production). These allow you to defer timeout exceptions a little bit, especially around during quick operations that need to be kept atomic.
  • Or, a catch statement high-up in Parsoid that would perform some kind of emergency clean up before letting the exception propagate to the outside.
  • Or, ensure Parsoid has its own Remex instance, which would naturally mean that when a fatal error happens and the Parser/Parsoid invocation is discarded in favour of an error page, that anything used there is naturally not reachable or re-used (assuming no global state binding it back together).
  • Or, ensure core has its own instance.

Untagging Excimer since this isn't an issue with the php-excimer library itself. For help with critical sections, core message tidying, or rest.php error page handling; I'd recommend tagging Platform Engineering.

Krinkle renamed this task from PHP Notice: Undefined offset: 1634118181 to PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page.Nov 4 2021, 9:04 PM

I think I got this wrong since it seems odd that the state Remex state would be shared across content and interface message, and across Parsoid and MW core.

Right, that's not what I'm saying. I'm saying I suspect this is memory corruption.

Change 736929 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Add shorttimeout option to X-Wikimedia-Debug

https://gerrit.wikimedia.org/r/736929

Change 736929 merged by jenkins-bot:

[operations/mediawiki-config@master] Add shorttimeout option to X-Wikimedia-Debug

https://gerrit.wikimedia.org/r/736929

Mentioned in SAL (#wikimedia-operations) [2021-11-04T23:51:02Z] <tstarling@deploy1002> Synchronized wmf-config/CommonSettings.php: XWD timeout testing T293568 (duration: 00m 54s)

I did 100 timeout requests against production with the shorttimeout option. Nothing obvious went wrong, all responses have length 17447, nothing in error.log, nothing in exception-json.log except the timeouts.

The notice message indicates that one of the scope arrays ($allScopes, $nonTableScopes, $listScopes, $buttonScopes or $selectOnly) became corrupted. These are private static properties which are meant to be constant. Nothing ever assigns to them. So it's highly unlikely that the problem is a lack of exception safety in RemexHtml.

One possibility is a PHP VM bug. These arrays are returned by getScopeTypesToStack() into a temporary variable which is used for foreach. Perhaps there is a critical section for interrupt handling during function return that causes the VM stack slot holding the return value to act like a dangling pointer.

It might be interesting to patch Excimer to work like a breakpoint instead of like a timeout. We could unconditionally set EG(vm_interrupt)=1 in the handler, causing the interrupt function to fire every time, then call back into userspace when the opline matches a constant value.

I filed T295105 in case it is helpful as another example. But, that could be merged into this.

It might be interesting to patch Excimer to work like a breakpoint instead of like a timeout. We could unconditionally set EG(vm_interrupt)=1 in the handler, causing the interrupt function to fire every time, then call back into userspace when the opline matches a constant value.

That doesn't work because it goes into an infinite tail call loop waiting for EG(vm_interrupt) to become 0 again before moving on to the next op.

That doesn't work because it goes into an infinite tail call loop

So I patched PHP 8.0 to not do that, then I was able to reproduce the warning "Undefined array key "p"" by throwing an exception from CachingStack::push() at an oparray offset of 53, which is the fetch of HTMLData::NS_HTML on line 247. After the exception is thrown, ZEND_HANDLE_EXCEPTION_SPEC_HANDLER() frees the result of the throwing opcode, which is the temporary variable T7, but since the fetch has not been executed, T7 still holds self::$predicateMap, which was loaded on line 240. The array is freed and self::$predicateMap becomes a dangling pointer.

nikic's overview of the PHP 7 VM says:

The case of the result operand is more tricky, because the answer here changed between PHP 7.1 and 7.2: In PHP 7.1 the instruction was responsible for freeing the result in case of an exception. In PHP 7.2 it is automatically freed (and the instruction is responsible for making sure the result is always populated).

Throwing an exception from zend_interrupt_function apparently prevents the result from being populated.

Minimal test case:

<?php

class C {
	public static $cond = 1;
	public static $a;
}

C::$a = [ C::$cond ]; // make countable zval

function go() {
	while ( true ) {
		$cond = C::$cond;
		// T1 = FETCH_STATIC_PROP_R string("a") string("C")
		C::$a;
		// FREE T1
		// JMPZ CV0($cond) 0000
		if ( $cond )
			// T1 = FETCH_STATIC_PROP_R string("a") string("C")
			// (not executed but T1 freed)
			C::$a;
	}
}

pcntl_async_signals( true );
pcntl_signal( SIGUSR1, function () {
	throw new Exception( 'ping' );
} );
while ( true ) {
	try {
		go();
	} catch ( Exception $e ) {}
}

Run with opcache.enable_cli=1 since the tmpvar sharing is introduced by the optimizer. With stock PHP 8.0.12 it segfaults after receiving about 4 SIGUSR1 signals.

I tried moving the opline assignment in ZEND_VM_JMP() to after the interrupt check, but in e.g. ZEND_JMPNZ, the opline has already been incremented by the time ZEND_VM_JMP() is called. It would be necessary to change how conditional jumps work somewhat.

I tried moving the interrupt check out of ZEND_VM_SET_OPCODE(), instead putting it at the top of every JMP* handler. The idea is that if an interrupt is executed, the handler will return early without incrementing opline, so the same handler will be re-entered. The complication with this idea is that smart branches do not enter the jump handlers, they make their own call to ZEND_VM_SET_OPCODE(). A few other things call ZEND_VM_SET_OPCODE() and would need to get special handling, for example ZEND_DO_ICALL.

The general idea is that userspace code should never be executed unless EX(opline) points to an opline which has been fully executed, or at least has initialised its return value.

So it turns out that Dmitry and Nikic were working on the same problem simultaneously, and a fix was just committed. I confirmed that https://github.com/php/php-src/commit/fa0b84a06b03a1c2a2bcadd647232a8a4a90aa05 can be cleanly applied to PHP 7.4, and it fixes the reduced test case. After applying the fix, I was able to send 600 timeout exceptions to Parsoid without it segfaulting or aborting, when compiled in debug mode with AddressSanitizer. Before the fix, it segfaulted after handling 5 timeouts.

@MoritzMuehlenhoff I would suggest applying that commit to PHP 7.4 in production, so that we will get the fix as PHP 7.4 is rolled out. I don't think it's urgent enough to backport to 7.2.

In PHP git master there is a remaining issue which caused AddressSanitizer to detect a use-after-free in debug_backtrace_get_args() after about 50 timeouts.

Talked with Moritz, I'll take care of cherry-picking the patch for 7.4.

Mentioned in SAL (#wikimedia-operations) [2021-11-15T21:21:57Z] <legoktm> uploaded php7.4_7.4.25-1+wmf2+buster1_amd64.changes to apt.wm.o with patch for T293568

Change 739005 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/docker-images/production-images@master] Rebuild PHP 7.4 images for T293568 patch

https://gerrit.wikimedia.org/r/739005

Change 739005 merged by Legoktm:

[operations/docker-images/production-images@master] Rebuild PHP 7.4 images for T293568 patch

https://gerrit.wikimedia.org/r/739005

The 7.4 packages have been updated (7.4.25-1+wmf2+buster1 is the correct version), I rebuilt the Docker images and upgraded deployment-mediawiki12 in the beta cluster.

I guess this should be left open and blocked on T271736: Migrate WMF Production from PHP 7.2 to PHP 7.4?