OOM reported at SpecialPage.php:534 due to large output from Special:Book
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	• mmodell
	Feb 19 2015, 12:49 AM

Description

This is currently the #1 most common error in fatalmonitor. Seems to be a new as of 1.25wmf17 (?)

Logstash query: type:hhvm AND message:"request has exceeded memory limit"

Revisions and Commits

rOPUP Wikimedia Puppet
	rOPUP9683ee96667c Increase HHVM's memory_limit from 300M to 500M

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	PRODUCTION ERROR	• demon	T89918 OOM reported at SpecialPage.php:534 due to large output from Special:Book
		Resolved		tstarling	T91468 HHVM with FastCGI does not support streaming output

Event Timeline

• mmodell created this task.Feb 19 2015, 12:49 AM

• mmodell raised the priority of this task from to Needs Triage.

• mmodell updated the task description. (Show Details)

• mmodell added a project: Wikimedia-production-error.

• mmodell added subscribers: • mmodell, • demon, MediaWiki-Core-Team.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2015, 12:49 AM

I was seeing it with wmf16 as well. We just rotated the major branches today so the branch # has changed in the log.

@chad was it high on the list before? I didn't notice it and now it's suddenly #1 with 250+ instances of this error

Maybe not quite as high? Things were really drowned out by T89505 and T89258 though in the last week.

Sadly without a stack trace and request URL this one is about impossible to track down and squash.

I don't think it's actually running out of memory at that line, it probably ran out some time before it reached that line, see https://github.com/facebook/hhvm/issues/4862 . Enabling LogNativeStackOnOOM may help to isolate it.

Aklapper triaged this task as High priority.Feb 19 2015, 11:17 AM

It seems unlikely this line is the cause:

	public function getOutput() {
		return $this->getContext()->getOutput();      <------- This one
	}

RequestContext::getOutput allocates an OutputPage but I would expect to see something nasty near an out of memory error. Stuff like giant string concatenation or array copy stuff. I don't really know much about what consumes memory in PHP though.

Using the 5xx logs on oxygen I obtained a URL which reliably causes an error at SpecialPage.php:534 as described:

http://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=download&collection_id=3fd9b5e50dd79cbed462e304f6eb69c3a6e38a04&writer=rdf2latex

The response from HHVM has a header Content-Length: 205860154, i.e. 196MB, but the body is replaced with an error message. Varnish apparently doesn't like the incorrect Content-Length and replaces the 500 with its own 503 error message. We had OOMs under Zend in the output stage due to very large output, when the output buffer was duplicated, so that's probably what is happening here also.

I think the issue is a bug in HHVM's ob_get_status(). ExecutionContext::obGetStatus() has

if (level < m_protectedLevel) {
  status.set(s_type, 1);
  status.set(s_name, s_default_output_handler);
} else {
  status.set(s_type, 0);
  status.set(s_name, buffer.handler);
}

I think the condition is the wrong way around. If level < m_protectedLevel, it should return 0 (PHP_OUTPUT_HANDLER_INTERNAL). I confirmed that under HHVM, wfResetOutputBuffers() as called by Special:Export aborts in this branch with $status['name'] == 'wfOutputHandler':

if ( $status['type'] == 0 /* PHP_OUTPUT_HANDLER_INTERNAL */ ) {
	// Probably from zlib.output_compression or other
	// PHP-internal setting which can't be removed.

Filed upstream: https://github.com/facebook/hhvm/pull/4868

There's more to it than that, unfortunately. If I request a script via FastCGI which does only:

<?php
readfile('/dev/zero');

It rapidly fails with \nFatal error: String length exceeded 2^31-2: 2147483647 in /srv/mw/core/readfile.php on line 2, without sending any data. So it seems like readfile() is inherently broken. According to the documentation, it doesn't need to read the file into memory unless ob_start() is used.

hphp_context_init() unconditionally calls obStart() with no arguments, which sets up an output buffer with a chunk size of zero. ExecutionContext::write() interprets zero as infinity and buffers all output unless either ob_implicit_flush(true) is called or the configuration option Server.ImplicitFlush (hhvm.server.implicit_flush) is true. If this is done, it will write in 1024-byte chunks. Contrast this to PHP PHP which writes 8192-byte chunks. So HHVM in implicit_flush mode will work, although it will use much more system CPU time than PHP PHP.

I don't know the reason for this compatibility break -- perhaps an attempt at micro-optimisation?

• MZMcBride subscribed.Feb 20 2015, 5:40 AM

greg added a project: HHVM.Mar 3 2015, 12:02 AM

greg set Security to None.

tstarling renamed this task from error: request has exceeded memory limit in /srv/mediawiki/php-1.25wmf17/includes/specialpage/SpecialPage.php on line 534 to OOM reported at SpecialPage.php:534 due to large output from Special:Book.Mar 3 2015, 11:16 PM

tstarling mentioned this in T91468: HHVM with FastCGI does not support streaming output.Mar 3 2015, 11:36 PM

Eloquence subscribed.Mar 5 2015, 5:37 AM

jeremyb added a project: OfflineContentGenerator.Mar 5 2015, 5:38 AM

We could increase the memory limit, as a workaround.

Could we predict ahead of time if a PDF will be too big and give a relevant error message instead of 500 turned into 503? (separate question from raising memory limit)

@tstarling any near term fix would be great as this impacts a lot of users and we will have to temporarily disable PDF generation altogether if we can't get it under control soon. flagging to @bd808 as well -- would be good if this could be prioritized as a critical issue.

A possible stop gap solution until there is a fix for T91468 that @tstarling suggested would be to limit the PDF size that we try to return. This would allow MediaWiki to stop the request with a reasonable error message instead of trying to stream the PDF and aborting with a 503 when the internal buffer overflows.

Another idea proposed by @ori would be to isolate some servers just for PDFs and proxy to the from the main server pool. If the PDF server returned a 503 to the frontend then a nice error message could be presented to the waiting user.

Can we measure how many successes/failure we're currently getting? If the failure rate is too high, I think we need to disable the feature altogether until the root cause is resolved rather than putting in stopgap error pages.

A logstash query for type:hhvm AND message:"request has exceeded memory limit" shows 7292 occurrences for the 24 hour period from 2015-03-11T02:50 to 2015-03-12T02:50. The seven day figure is 60953 events. These are not all guaranteed to be PDF serving failures. We may have other content that triggers the HHVM output buffer to overflow.

Why don't we start by raising the memory limit, as suggested above, and see if we can make the incidence rate much lower than it is currently.

As of D32721, HHVM's default is 16 GB, which seems a little insane. (We're at 300MB currently.)

bd808 mentioned this in T84842: Convert eqiad imagescalers to HHVM, Trusty.Mar 13 2015, 5:42 PM

Krenair subscribed.Mar 15 2015, 11:11 PM

Although the memory limit was raised, i still see memory exceeded errors on SpecialPage.php:534

• Manybubbles mentioned this in T92847: Wikipedia API returns 503 for one specific search (OOM).Mar 17 2015, 8:10 PM

About 99.99% of these errors come from what is apparently a single user, cycling slowly through *.dip0.t-ipconnect.de IP addresses (about two IP addresses per day), with UA always "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)".

In 5xx.tsv.log-20150318.gz , there were 10511 log entries for this T-Online user, and 2 log entries for another user, i.e. two unique were affected during that day.

So, lowering priority, so that @Joe can roll out my fix for T91468 with the upgrade to HHVM 3.6, so that we don't need to backport it to 3.3.

ori closed this task as Resolved by committing rOPUP9683ee96667c: Increase HHVM's memory_limit from 300M to 500M.Mar 23 2015, 3:52 PM

ori added a commit: rOPUP9683ee96667c: Increase HHVM's memory_limit from 300M to 500M.

Commit message that auto-closed is "Increasing the memory limit won't fix T89918". Phab phail.

bd808 removed a project: MediaWiki-Core-Team.Apr 8 2015, 10:44 PM

This had 8000 hits over 7 days.

It is pending rollout of a new HHVM version for T91468.

hashar moved this task from Untriaged to Mar 2019 / 1.33wmf20-23 on the Wikimedia-production-error board.Jun 1 2015, 7:59 PM

Ricordisamoa subscribed.Jul 14 2015, 11:49 AM

Troubled.asset subscribed.Jul 25 2015, 8:08 AM

Anomie mentioned this in T109842: wfResetOutputBuffers() is broken on PHP 5.6.Aug 21 2015, 4:23 PM

Joe closed subtask T91468: HHVM with FastCGI does not support streaming output as Resolved.Aug 24 2015, 12:57 PM

@tstarling patched HHVM to support streaming output ( T91468 ). This task is no more stalled.

• demon closed this task as Resolved.Oct 26 2015, 4:38 PM

• demon claimed this task.

• demon moved this task from Mar 2019 / 1.33wmf20-23 to Resolved on the Wikimedia-production-error board.

Luke081515 subscribed.Jan 6 2016, 8:17 PM

greg added a project: Essential-Work.Jan 11 2016, 10:50 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM

OOM reported at SpecialPage.php:534 due to large output from Special:BookClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

OOM reported at SpecialPage.php:534 due to large output from Special:Book
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...