The memory leak rate we see in production is not significantly reduced by hacking the PCRE cache. Theories needed.
We don't need complete absence of memory leaks, we just need to be able to restart HHVM once every few days instead of hourly.
tstarling | |
Oct 23 2014, 4:12 AM |
F550: pasted_file | |
Oct 29 2014, 4:53 AM |
F537: loadhtml-heap.pdf | |
Oct 28 2014, 5:26 PM |
F492: xmlNewText_backtrace.txt | |
Oct 26 2014, 6:44 PM |
F409: hhvm_leak.gnumeric | |
Oct 23 2014, 11:52 PM |
F393: hhvm_leak.gnumeric | |
Oct 23 2014, 6:11 AM |
The memory leak rate we see in production is not significantly reduced by hacking the PCRE cache. Theories needed.
We don't need complete absence of memory leaks, we just need to be able to restart HHVM once every few days instead of hourly.
With the URL
http://en.wikipedia.org/w/index.php?title=Module:Hatnote/doc&oldid=609632474
and a local hack to disable the #invoke hook, confirmed that luasandbox is not at fault. About 1833 bytes/req leaked with Lua, 2887 bytes/req leaked with Lua disabled.
Data collection one liner:
while true; do curl -s http://localhost:9002/jemalloc-stats | perl -ne 'if (/<allocated>([^<]*)/) {print localtime() . ", " . time() . ", $1\n";}' ; sleep 10; done
ab command line:
ab -X localhost:80 -c150 -n 50000 'http://en.wikipedia.org/w/index.php?title=Module:Hatnote/doc&oldid=609632474&disablelua=1'
Guiseppe suggests https://github.com/facebook/hhvm/issues/2973 , could try running without JIT.
Previous results were apparently garbage -- I did a longer run and it shows a plateau after about 40k requests. Previous runs stopped at 50k requests so it was easy to miss the plateau in a noisy signal. Stopping the load injection before each memory usage measurement gave very little noise and a plateau after ~16k requests.
I rebuilt libjemalloc1 with --prof-enabled:
$ apt-get source libjemalloc1 $ cd jemalloc-3.5.1 $ cat >> debian/rules override_dh_auto_configure: dh_auto_configure -- --enable-prof $ dch --distribution trusty-wikimedia --local wmf "Rebuilt with --enable-prof" $ sudo dpkg-buildpackage -b -uc
Then I installed the resultant package on mw1114 and mw1189, the HHVM API servers. (It's on the API servers that the leak is most evident.)
I then enabled jemalloc's heap profiling capabilities via /etc/malloc.conf. The way to do it is to make that file a symlink whose target is an option string rather than an actual file. Strange!
$ ln -sf "prof:true,prof_gdump:true,lg_prof_sample:21,prof_prefix:/tmp/heaps/" /etc/malloc.conf
prof_gdump ends up being less useful than I thought. Because virtual memory usage is steadily increasing, there is a new high water mark every few seconds, so I end up with a lot of noise.
The next thing I try is to use the admin server API to request a heap profile every ten minutes.
#!/bin/bash # Request a heap profile from HHVM once every ten minutes. # Profiles are written to /tmp/heaps. sudo -u apache mkdir -p /tmp/heaps curl "localhost:9002/jemalloc-prof-activate" for (( i=1; ; i++ )); do curl "localhost:9002/jemalloc-prof-dump?file=/tmp/heaps/${i}.heap" sleep 600 done
My plan is to use pprof's --base arg to subtract a heap profile from its successor. The basic invocation looks like this:
$ google-pprof `which hhvm` --show_bytes --base=1.heap 2.heap
This puts you in interactive mode. You can run 'top' to see the top 10 routines, etc.
I hack together a simple shell script to help:
#!/bin/bash for (( i=1; ; i++ )); do base="$((i)).heap" prof="$((i+1)).heap" [ -e "$prof" ] || break echo "-------------------- $base => $prof --------------------" google-pprof `which hhvm` --show_bytes --text --base="$base" "$prof" 2>/dev/null | head -6 done
Here's the output for an hour's worth of heap profiles:
-------------------- 1.heap => 2.heap -------------------- Total: 573794313 B 402653184 70.2% 70.2% 402653184 70.2% HPHP::AsyncFuncImpl::start 48235840 8.4% 78.6% 56624736 9.9% xmlNewText 20972160 3.7% 82.2% 94374384 16.4% xmlSAX2StartElement 18913696 3.3% 85.5% 18913696 3.3% 00007fe80e4343b8 16777520 2.9% 88.5% 16777520 2.9% xmlStrndup -------------------- 2.heap => 3.heap -------------------- Total: 483619254 B 402653184 83.3% 83.3% 402653184 83.3% HPHP::AsyncFuncImpl::start 50333056 10.4% 93.7% 56624648 11.7% xmlNewText 25166592 5.2% 98.9% 25166592 5.2% xmlNewNodeEatName 23069376 4.8% 103.6% 106957784 22.1% xmlSAX2StartElement 8388760 1.7% 105.4% 8388760 1.7% xmlStrndup -------------------- 3.heap => 4.heap -------------------- Total: 139819215 B 44041360 31.5% 31.5% 50332872 36.0% xmlNewText 43129392 30.8% 62.3% 43129392 30.8% HPHP::MemoryManager::newSlab 12583296 9.0% 71.3% 81790932 58.5% xmlSAX2StartElement 12583296 9.0% 80.3% 12583296 9.0% xmlNewNodeEatName 12582979 9.0% 89.3% 12582979 9.0% xmlStrndup -------------------- 4.heap => 5.heap -------------------- Total: 1258200573 B 1207959552 96.0% 96.0% 1207959552 96.0% HPHP::AsyncFuncImpl::start 33555344 2.7% 98.7% 39846844 3.2% xmlNewText 12609571 1.0% 99.7% 12609571 1.0% 00007fe80e4343b8 12605141 1.0% 100.7% 12605141 1.0% pcre_compile2 12583296 1.0% 101.7% 12583296 1.0% xmlNewNodeEatName -------------------- 5.heap => 6.heap -------------------- Total: 129656509 B 35652544 27.5% 27.5% 37749720 29.1% xmlNewText 33176456 25.6% 53.1% 33176456 25.6% HPHP::MemoryManager::newSlab 23069376 17.8% 70.9% 23069376 17.8% xmlNewNodeEatName 14694959 11.3% 82.2% 14694959 11.3% 00007fe80e4343b8 14680512 11.3% 93.5% 81791088 63.1% xmlSAX2StartElement
Working with XML could be the most common, memory-intensive things the API servers do, so the regular appearance of xmlNewText in these lists could be innocent, but it looks suspicious to me.
I'm out of time for now, but the next thing I plan to do is set a breakpoint on xmlNewText in gdb and capture some backtraces.
The backtraces that I've captured finger c_DOMDocument::t_loadhtml (in [[https://github.com/facebook/hhvm/blob/0a3cfb87b8a353fc7e1d15374f4adc413e37aba9/hphp/runtime/ext/ext_domdocument.cpp#L3381-3384|hphp/runtime/ext/ext_domdocument.cpp:3381]]) and its IDL proxy DOMDocument::loadHTML (in [[https://github.com/facebook/hhvm/blob/0a3cfb87b8a353fc7e1d15374f4adc413e37aba9/hphp/system/idl/domdocument.idl.json#L2639-2655|hphp/system/idl/domdocument.idl.json:2639]]) as most suspicious. The frames above them are generic HHVM execution context things and the frames below are in libxml.
Running ApiHelp::fixHelpLinks() on some sample HTML (from http://en.wikipedia.org/wiki/Main_Page) leaked quite nicely. Further isolated to this:
for ( $i =0 ; $i < 1000; $i++ ) { $doc = new DOMDocument; $doc->loadHTML($html); $xpath = new DOMXPath( $doc ); $nodes = $xpath->query( '//a' ); }
The leak does not occur if $xpath->query() is omitted.
Awesome, thanks for the investigation. I can take a look at fixing this unless one of you already is.
<TimStarling> for ( $i =0 ; $i < 1000; $i++ ) { $doc = new DOMDocument; $doc->loadHTML($html); $node = $doc->documentElement->firstChild; } <TimStarling> no need for xpath, you just need a DOMNode reference to be alive when $doc is overwritten
swtaars: awesome, go for it.
Further gdb work showed that fetching a node from a document as in the test case above created a circular reference which prevented the destructor being called, but the memory was freed on sweep. Also, fetching a few ApiHelp URLs with ab did not appear to permanently increase memory usage.
Grep indicates that the extension callers of DOMDocument::loadHTML() are CirrusSearch\BuildDocument\PageTextBuilder, TextExtracts, MobileFrontend, DonationInterface, Flow\Parsoid\Utils, CommonsMetadata and SemanticForms SF_AutoeditAPI.
Test case:
<?php require_once '/srv/mediawiki/multiversion/MWVersion.php'; require_once getMediaWiki( 'maintenance/commandLine.inc', 'enwiki' ); $title = Title::newMainPage(); $page = WikiPage::factory( $title ); $content = $page->getContent(); $parserOutput = $content->getParserOutput( $title ); $doc = new \Elastica\Document; $textBuilder = new \CirrusSearch\BuildDocument\PageTextBuilder( $doc, $content, $parserOutput ); for ($i = 0; $i < 100; $i++) { $textBuilder->build(); } $usage = getrusage(); printf("%.2f KiB\n", $usage['ru_maxrss'] / 1024);
Invoke with --count N on the app servers.
ru_maxrss is measured in KB, so dividing it by 1024 you actually get MB not KB.
I tried running the test case under valgrind -- it reports the memory in question as "still reachable" at exit.
Here's some stuff I wrote up before seeing @ori's most recent test case:
Can you live with that cyclic reference for now? It sounds like the global leak is a much bigger problem, so I've been focusing on trying to find that. I noticed that none of the repro cases we were using seemed to reference xmlNewText(), which is the main culprit in your heap dumps. The constructor for c_DOMText looks like the main caller for that, which interestingly doesn't get called if you create a DOMText object by traversing the dom from a DOMDocument root. It does get called if the object is created manually though, and running this program with --count=100000 shows a pretty clear and enormous leak:
<?php for ($i = 0; $i < 100000; ++$i) { new domtext('whoops'); } var_dump(getrusage()['ru_maxrss'] / 1024);
I can't tell if that happens in the repro from @ori, but it could be the source of your leak. I'm trying out some stuff on mw1114 to confirm.
I can't find any evidence of that path getting hit; it looks like all the leaked memory is still coming from the call to htmlParseDocument in dom_load_html (I've been looking at call graphs from pprof's --pdf option).
Here's one of those graphs, from running the repro case for a while:
. 90% of the heap is from DOMDocument::loadHTMLMinimal test case:
<?php for ($i = 0; $i < 100; $i++) { $doc = new DOMDocument(); $doc->loadHTML('<html><body></body></html>'); $el = $doc->getElementsByTagName('body')->item(0); $el->parentNode->removeChild($el); } var_dump(getrusage()['ru_maxrss'] / 1024);
Hackish fix for confirmation purposes:
diff --git a/hphp/runtime/ext/ext_domdocument.cpp b/hphp/runtime/ext/ext_domdocument.cpp index 0f4f8f1..402c113 100644 --- a/hphp/runtime/ext/ext_domdocument.cpp +++ b/hphp/runtime/ext/ext_domdocument.cpp @@ -2299,6 +2299,7 @@ Variant c_DOMNode::t_removechild(const Object& node) { while (children) { if (children == child) { xmlUnlinkNode(child); + appendOrphan(*doc()->m_orphans, child); return create_node_object(child, doc(), true); } children = children->next;
the whole day went on and any leak, if present, is now minor. I'd say the fix is working.
Need better workaround since the patch is segfaulting. FB may potentially fix the upstream bug, but it will take weeks or months.
I rebased Tim's fix onto current master and the small repro is segfaulting now. I took a deeper look at the code and think I might have a more robust fix. I'll update as soon as I know more.
Rebased Brett's patch to HHVM-3.3.0 and rebuilt our packages with it applied.
Brett's patch: https://gist.github.com/swtaarrs/a9d0808e63c8d2100051
Rebased patch: https://gist.github.com/atdt/57ad7d6e215ad191b9db
New package version: 3.3.0+dfsg1-1+wm2
After testing package on mw1017 and beta, I upgraded all HHVM app servers.
Crashes are still happening with the new patch, which I kinda expected. We poked at a core file for a while and it looks like it's crashing on a double free of an <a> element with these parents: <body><div id="mainpage"><table id="mp-tfa"><tr><td><p><a href="/wiki/প্রামাণ্যচিত্র">প্রামাণ্যচিত্র</a>. The document also contains headers for "featured article" and "did you know", strongly suggesting it has something to do with this code: https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/master/includes/MobileFormatter.php.
Repro case:
<?php $n = 32; while ( $n-- ) { $doc = new DOMDocument(); $doc->loadHTML( '<html><body><div id="test"><img /></div></body></html>' ); $element = $doc->getElementsByTagName( 'img' )->item( 0 ); $element->parentNode->replaceChild( $doc->createElement( 'img' ), $element ); $content = $doc->createElement( 'div' ); $content->appendChild( $doc->getElementById( 'test' ) ); }
Doesn't crash for n < 32 for whatever reason.
When I run the latest repro in an asan build, it reliably yells about a double free regardless of $n. This diff applied on top of the previous patch should fix the crash, and also another leak: https://gist.github.com/swtaarrs/2577519dbcf6e6e062a2
Combined patch for 3.3.0:
--- a/hphp/runtime/ext/ext_domdocument.cpp +++ b/hphp/runtime/ext/ext_domdocument.cpp @@ -2296,6 +2296,7 @@ while (children) { if (children == child) { xmlUnlinkNode(child); + appendOrphan(*doc()->m_orphans, child); return create_node_object(child, doc(), true); } children = children->next; @@ -2357,7 +2358,11 @@ xmlReplaceNode(oldchild, newchild); dom_reconcile_ns(nodep->doc, newchild); } - return create_node_object(oldchild, doc(), false); + if (domnewchildnode->doc().get()) { + removeOrphanIfNeeded(*domnewchildnode->doc()->m_orphans, domnewchildnode->m_node); + } + return create_node_object(oldchild, doc(), + oldchild->parent == nullptr); } php_dom_throw_error(NOT_FOUND_ERR, doc()->m_stricterror);