Page MenuHomePhabricator

OOM reported at SpecialPage.php:534 due to large output from Special:Book
Closed, ResolvedPublicPRODUCTION ERROR

Description

This is currently the #1 most common error in fatalmonitor. Seems to be a new as of 1.25wmf17 (?)

Logstash query: type:hhvm AND message:"request has exceeded memory limit"

Event Timeline

mmodell raised the priority of this task from to Needs Triage.
mmodell updated the task description. (Show Details)

I was seeing it with wmf16 as well. We just rotated the major branches today so the branch # has changed in the log.

@chad was it high on the list before? I didn't notice it and now it's suddenly #1 with 250+ instances of this error

Maybe not quite as high? Things were really drowned out by T89505 and T89258 though in the last week.

Sadly without a stack trace and request URL this one is about impossible to track down and squash.

I don't think it's actually running out of memory at that line, it probably ran out some time before it reached that line, see https://github.com/facebook/hhvm/issues/4862 . Enabling LogNativeStackOnOOM may help to isolate it.

It seems unlikely this line is the cause:

	public function getOutput() {
		return $this->getContext()->getOutput();      <------- This one
	}

RequestContext::getOutput allocates an OutputPage but I would expect to see something nasty near an out of memory error. Stuff like giant string concatenation or array copy stuff. I don't really know much about what consumes memory in PHP though.

Using the 5xx logs on oxygen I obtained a URL which reliably causes an error at SpecialPage.php:534 as described:

http://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=download&collection_id=3fd9b5e50dd79cbed462e304f6eb69c3a6e38a04&writer=rdf2latex

The response from HHVM has a header Content-Length: 205860154, i.e. 196MB, but the body is replaced with an error message. Varnish apparently doesn't like the incorrect Content-Length and replaces the 500 with its own 503 error message. We had OOMs under Zend in the output stage due to very large output, when the output buffer was duplicated, so that's probably what is happening here also.

I think the issue is a bug in HHVM's ob_get_status(). ExecutionContext::obGetStatus() has

if (level < m_protectedLevel) {
  status.set(s_type, 1);
  status.set(s_name, s_default_output_handler);
} else {
  status.set(s_type, 0);
  status.set(s_name, buffer.handler);
}

I think the condition is the wrong way around. If level < m_protectedLevel, it should return 0 (PHP_OUTPUT_HANDLER_INTERNAL). I confirmed that under HHVM, wfResetOutputBuffers() as called by Special:Export aborts in this branch with $status['name'] == 'wfOutputHandler':

if ( $status['type'] == 0 /* PHP_OUTPUT_HANDLER_INTERNAL */ ) {
	// Probably from zlib.output_compression or other
	// PHP-internal setting which can't be removed.

There's more to it than that, unfortunately. If I request a script via FastCGI which does only:

<?php
readfile('/dev/zero');

It rapidly fails with \nFatal error: String length exceeded 2^31-2: 2147483647 in /srv/mw/core/readfile.php on line 2, without sending any data. So it seems like readfile() is inherently broken. According to the documentation, it doesn't need to read the file into memory unless ob_start() is used.

hphp_context_init() unconditionally calls obStart() with no arguments, which sets up an output buffer with a chunk size of zero. ExecutionContext::write() interprets zero as infinity and buffers all output unless either ob_implicit_flush(true) is called or the configuration option Server.ImplicitFlush (hhvm.server.implicit_flush) is true. If this is done, it will write in 1024-byte chunks. Contrast this to PHP PHP which writes 8192-byte chunks. So HHVM in implicit_flush mode will work, although it will use much more system CPU time than PHP PHP.

I don't know the reason for this compatibility break -- perhaps an attempt at micro-optimisation?

greg set Security to None.
tstarling renamed this task from error: request has exceeded memory limit in /srv/mediawiki/php-1.25wmf17/includes/specialpage/SpecialPage.php on line 534 to OOM reported at SpecialPage.php:534 due to large output from Special:Book.Mar 3 2015, 11:16 PM

We could increase the memory limit, as a workaround.

Could we predict ahead of time if a PDF will be too big and give a relevant error message instead of 500 turned into 503? (separate question from raising memory limit)

@tstarling any near term fix would be great as this impacts a lot of users and we will have to temporarily disable PDF generation altogether if we can't get it under control soon. flagging to @bd808 as well -- would be good if this could be prioritized as a critical issue.

A possible stop gap solution until there is a fix for T91468 that @tstarling suggested would be to limit the PDF size that we try to return. This would allow MediaWiki to stop the request with a reasonable error message instead of trying to stream the PDF and aborting with a 503 when the internal buffer overflows.

Another idea proposed by @ori would be to isolate some servers just for PDFs and proxy to the from the main server pool. If the PDF server returned a 503 to the frontend then a nice error message could be presented to the waiting user.

Can we measure how many successes/failure we're currently getting? If the failure rate is too high, I think we need to disable the feature altogether until the root cause is resolved rather than putting in stopgap error pages.

A logstash query for type:hhvm AND message:"request has exceeded memory limit" shows 7292 occurrences for the 24 hour period from 2015-03-11T02:50 to 2015-03-12T02:50. The seven day figure is 60953 events. These are not all guaranteed to be PDF serving failures. We may have other content that triggers the HHVM output buffer to overflow.

Why don't we start by raising the memory limit, as suggested above, and see if we can make the incidence rate much lower than it is currently.

As of D32721, HHVM's default is 16 GB, which seems a little insane. (We're at 300MB currently.)

Although the memory limit was raised, i still see memory exceeded errors on SpecialPage.php:534

tstarling lowered the priority of this task from High to Medium.Mar 19 2015, 5:51 AM

About 99.99% of these errors come from what is apparently a single user, cycling slowly through *.dip0.t-ipconnect.de IP addresses (about two IP addresses per day), with UA always "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)".

In 5xx.tsv.log-20150318.gz , there were 10511 log entries for this T-Online user, and 2 log entries for another user, i.e. two unique were affected during that day.

So, lowering priority, so that @Joe can roll out my fix for T91468 with the upgrade to HHVM 3.6, so that we don't need to backport it to 3.3.

Commit message that auto-closed is "Increasing the memory limit won't fix T89918". Phab phail.

hashar changed the task status from Open to Stalled.Jun 1 2015, 7:58 PM
hashar updated the task description. (Show Details)
hashar subscribed.

This had 8000 hits over 7 days.

It is pending rollout of a new HHVM version for T91468.

hashar changed the task status from Stalled to Open.Aug 24 2015, 1:09 PM

@tstarling patched HHVM to support streaming output ( T91468 ). This task is no more stalled.

demon claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM