Page MenuHomePhabricator

Zest.php: Pagebundle routes timeout for some pages
Closed, ResolvedPublic

Description

ssastry@scandium:/srv/deployment/parsoid/deploy/src$ time curl -x scandium.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Ken%20Schrader/913657403 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1339k    0 1339k    0     0   168k      0 --:--:--  0:00:07 --:--:--  358k

real	0m7.962s
user	0m0.012s
sys	0m0.004s

ssastry@scandium:/srv/deployment/parsoid/deploy/src$ time curl -x scandium.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/transform/wikitext/to/pagebundle/Ken%20Schrader/913657403 -X POST < /dev/null
<!DOCTYPE html>
<html lang="en" dir="ltr">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: inherit; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf.png" srcset="https://www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Error</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. Please <a href="" title="Reload this page" onclick="location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p><p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
<div class="footer">
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
  PHP fatal error: <br/> Maximum execution time of 180 seconds exceeded</code></p></div>
</html>

real	3m0.308s
user	0m0.016s
sys	0m0.000s

In logstash, we find:

w/rest.php/en.wikipedia.org/v3/transform/wikitext/to/pagebundle/Ken%20Schrader/913657403   PHP Fatal Error from line 216 of /srv/deployment/parsoid/deploy/vendor/wikimedia/zest-css/src/ZestInst.php: Maximum execution time of 180 seconds exceeded

Not sure if this is a problem with Zest or with how Zest is used, but, pagebundle routes are failing where the regular routes aren't.

Event Timeline

ssastry triaged this task as High priority.Sep 6 2019, 3:29 AM

Is there any way to get a fuller stack trace for the failure?

$ time php bin/parse.php --pageName 'Ken Schrader' --domain en.wikipedia.org < /dev/null
real	0m35.269s
user	0m7.081s
sys	0m0.332s

So probably something actually related to the pagebundle transformation, rather than in the parse itself (since the parse seems to execute in a reasonable time).

<cscott> intuition is that it's something about how we're pulling out data-parsoid or adding ids, i don't recall zest being used in that code but if it is it would fall under suspicion
<subbu> all the timeouts are in zest .. so, that points at zest.
<cscott> oh, there are other failures of this sort?  how many?
<subbu> before the disable-gc patch, htere were a few spread around the place and they are all gone now .. i thought even the zest ones were related to gc .. but since they persisted and the only ones remaining, i suspect it is zest related.
<subbu> let me point you to the logstash board, one sec.
<cscott> just rough order of magnitude is all i was looking for
<subbu> https://logstash.wikimedia.org/goto/af41f78bb2f15937d6638ded0c0adbce
<subbu> 206 instances
<subbu> of zest timeouts
<cscott> you're right that the more frequently this occurs w/ the limit always caused by zest makes it much more likely we're not simply unlucky in hitting execution time limits
<cscott> yeah, 206 is pretty definitive that it's actually zest at fault. ;)
<subbu> there is 1 that is not zest and that is: "PHP Fatal error: Maximum execution time of 180 seconds exceeded in /srv/deployment/parsoid/deploy/src/src/Utils/DOMTraverser.php on line 66 "
<cscott> hm!
<subbu> but, that could just be a zest fallout . i.e. this one took long but not 180s and tipped that one over the edge.
<subbu> but, not sure. :)
<subbu> anyway, look at that logstash board and you can get list of titles, etc.
<cscott> yeah, but i bet it's someplace that is using zest and DOMTraverser in close proximity
ssastry moved this task from Backlog to Performance on the Parsoid-PHP board.

The only obvious use of Zest in the pagebundle path I can see is:

		$dpScriptElt = DOMCompat::getElementById( $doc, 'mw-pagebundle' );

from DOMDataUtils::extractPageBundle(), called from ContentUtils::extractDpAndSerialize(), called from Parsoid::wikitext2html(), called from ParsoidHandler::wt2html().

The more likely path:

			DOMDataUtils::injectPageBundle( $document, DOMDataUtils::getPageBundle( $document ) );

from DOMPostProcessor::addMetaData() doesn't seem to use Zest at all.

The only pessimal case in DOMCompat::getElementById would seem to be if *every* element matched #mw-pagebundle -- in this case the traditional implementation would exit as soon as the first was found, but our compatibility thunk would still have to enumerate all of them into an array before returning the head of that array.

But I bet it's more likely that there's some sort of DOM loop which is causing Zest::getElementsById not to return at all. Still looking...

No, I found it. From DOMDataUtils.php:

	public static function storeInPageBundle( DOMElement $node, Env $env, stdClass $data ): void {
		$uid = $node->getAttribute( 'id' ) ?? '';
		$document = $node->ownerDocument;
		$pb = self::getPageBundle( $document );
		$docDp = $pb->parsoid;
		$origId = $uid ?: null;
		if ( array_key_exists( $uid, $docDp->ids ) ) {
			$uid = null;
			// FIXME: Protect mw ids while tokenizing to avoid false positives.
			$env->log( 'info', 'Wikitext for this page has duplicate ids: ' . $origId );
		}
		if ( !$uid ) {
			do {
				$docDp->counter += 1;
				$uid = 'mw' . PHPUtils::counterToBase64( $docDp->counter );
			} while ( DOMCompat::getElementById( $document, $uid ) );
			self::addNormalizedAttribute( $node, 'id', $uid, $origId );
		}
		$docDp->ids[$uid] = $data->parsoid;
		if ( isset( $data->mw ) ) {
			$pb->mw->ids[$uid] = $data->mw;
		}
	}

Note the loop in the middle, which calls getElementById() for every id it assigns, in order to ensure it's unused.
But DOMComat::getElementsById() calls ZestInst::getElementsById, and here's that code:

	public static function getElementsById( DOMNode $context, string $id ): array {
		$doc = ( $context instanceof \DOMDocument ) ?
			$context : $context->ownerDocument;
		// PHP doesn't provide an DOMElement-scoped version of
		// getElementById, so we can't call this directly on $context --
		// but that's okay because (1) IDs should be unique, and
		// (2) we verify the scope of the returned element below
		// anyway (to work around bugs with deleted-but-not-gc'ed
		// nodes).
		$r = $doc->getElementById( $id );
		// Note that $r could be null here because the
		// DOMDocument hasn't had an "id attribute" set, even if the id
		// exists in the document. See:
		// http://php.net/manual/en/domdocument.getelementbyid.php
		if ( $r !== null ) {
			// Verify that this node is actually rooted in the
			// document (or in the context), since the element
			// isn't removed from the index immediately when it
			// is deleted. (Also PHP's call is not scoped.)
			for ( $parent = $r; $parent; $parent = $parent->parentNode ) {
				if ( $parent === $context ) {
					return [ $r ];
				}
			}
			// It's possible a deleted-but-still-indexed element was
			// shadowing a later-added element, so we can't return
			// null here directly; fallback to a full search.
		}
		// Do an xpath search, which is still a full traversal of the tree
		// (sigh) but 25% faster than traversing it wholly in PHP.
		$xpath = new \DOMXPath( $doc );
		$query = './/*[@id=' . self::xpathQuote( $id ) . ']';
		return iterator_to_array( $xpath->query( $query, $context ) );
	}

So because \DOMDocument::getElementById is not guaranteed to return anything (if the caller didn't set the id attribute for the document`, we need to do a full traversal in the case where PHP's native implementation returns null.

So that turns the above algorithm into an O(N^2) algorithm, because we do an O(N) tree search for each of the O(N) nodes of the tree.

The best fix would be for PHP's DOM implementation not to suck. But I can come up with a reasonable workaround...

So this is another case where T215000: Fill gaps in PHP DOM's functionality bites us again and we need T217867: Port domino (or another spec-compliant DOM library) to PHP for a proper fix. But I've got a reasonable workaround.

Change 535290 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Avoid a O(N^2) blowup when assigning pagebundle ids

https://gerrit.wikimedia.org/r/535290

More of the workaround is in T232390: Remex does not use \DOMElement::setIdAttribute('id') by default, but fundamentally this doesn't fix the issue that cases where getElementById() return null have to fall back on a full O(N^2) search because we can't guarantee that *every* way that an element could have been created/imported into the document and/or an attribute set ended up calling DOMElement::setIdAttribute where needed. Of of course Zest can't *tell* if the user always set the id attribute, so there's an abstraction violation there, too...

Change 535290 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Avoid a O(N^2) blowup when assigning pagebundle ids

https://gerrit.wikimedia.org/r/535290

Change 535742 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Followup to 6fa28acd: Cleanup pass also needs the ids trick

https://gerrit.wikimedia.org/r/535742

Change 535742 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Followup to 6fa28acd: Cleanup pass also needs the ids trick

https://gerrit.wikimedia.org/r/535742

Hmm ... better, but there is still something to investigate. If this is not zest.php / getElementById, we should close this and open a new ticket. But, I am running out, so reporting the results for now:

ssastry@scandium:/srv/deployment/parsoid/deploy/src$ time curl -x scandium.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/transform/wikitext/to/pagebundle/Ken%20Schrader/913657403 -X POST < /dev/null > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1431k    0 1431k    0     0  11228      0 --:--:--  0:02:10 --:--:--  340k

real	2m10.587s
user	0m0.036s
sys	0m0.020s

2m 11s compared to ~8s for non-pagebundle requests.

The solution in https://gerrit.wikimedia.org/r/535290 and followups to avoid invoking getElementById fails because cite introduces a number id attributes and so we fallback to the full search.

The solution in https://gerrit.wikimedia.org/r/535290 and followups to avoid invoking getElementById fails because cite introduces a number id attributes and so we fallback to the full search.

Specifically <span id="mw-reference-text-cite_note-1" ..... So, if there is a way to refine the query selector to exclude mw- in the search, we may be able to get away with this. Otherwise, we'll have to change our prefix OR just use a more robust solution like building the index with a dom walk before running this pass on the dom.

Change 537199 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Fix inefficiencies looking up id attributes in storeDataAttribs

https://gerrit.wikimedia.org/r/537199

Change 537199 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix inefficiencies looking up id attributes in storeDataAttribs

https://gerrit.wikimedia.org/r/537199

This looks resolved based on the new test run on scandium. After about 5 minutes of the test run, I am already noticing that a number of previously crashing tests are completing quickly.