Page MenuHomePhabricator

HHVM memory leaks result in OOMs & 500 spikes
Closed, ResolvedPublic

Description

It looks like HHVM has a small memory leak, which results into appservers OOMing approximately every two weeks:

This usually happens in clusters as well, probably since appservers usually get (re)started at about the same time.

In turn, this causes spikes of 500s, as it happened today and can be seen with

118:04 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
219:32 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
319:45 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
420:48 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
521:27 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
621:53 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
722:10 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
823:07 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
900:02 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
1000:30 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
1101:13 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
1202:30 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 30.77% of data above the critical threshold [500.0]
1302:56 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
1403:29 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
1503:39 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
1605:33 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0]
1705:48 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
1813:58 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
1914:11 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
2015:31 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0]
2115:40 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
2216:01 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
2316:12 < icinga-wm> RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
2417:43 < icinga-wm> PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
.

@Joe currently thinks that this is not the same bug as T103886. From my (limited) understanding of that bug it doesn't seem the same either: memory usage is a slowly but gradually increasing, it doesn't look like it is correlated with deploy times.

This may be the same as T99525 which @ori had been debugging but was downgraded from UBN and eventually closed (possibly erroneously) as a duplicate of T103886.

This causes mini-outages every so often, hence prioritizing this to UBN.

Details

Related Gerrit Patches:

Event Timeline

faidon created this task.Jul 4 2015, 3:04 PM
faidon raised the priority of this task from to Unbreak Now!.
faidon updated the task description. (Show Details)
faidon added projects: acl*sre-team, HHVM.
faidon added subscribers: faidon, Joe, ori.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 4 2015, 3:04 PM
Joe added a comment.Jul 4 2015, 3:47 PM

For reference, the problem doesn't seem to present itself on the API appservers. On mw1117, HHVM is running since june 25 but just using up 20% of the memory

Joe added a comment.Jul 5 2015, 10:17 AM

I am collecting data on mw1059 (and mw1115 for comparison) to see what changes in the memory profile of both servers over time. In about 1 day of data we should have a very clear picture of what is going on.

Joe added a comment.Jul 6 2015, 7:42 AM

Re-doing the same analysis we've done for T99525 I found that little to nothing has changed:

  • On an standard appserver I see:
-------------------- /tmp/heaps/1.heap => /tmp/heaps/24.heap --------------------
Total: 272693332 B
270442119  99.2%  99.2% 270442119  99.2% HPHP::StringData::MakeUncounted
 4723130   1.7% 100.9%  4723130   1.7% pcre_compile2
 3146064   1.2% 102.1%  6291984   2.3% pcre_study
 3145920   1.2% 103.2%  3145920   1.2% _pcre_jit_compile
 2621640   1.0% 104.2%  2621640   1.0% HPHP::jit::SrcDB::insert

while on an API appserver I see

-------------------- /tmp/heaps/1.heap => /tmp/heaps/24.heap --------------------
Total: -116719395 B
1342177280 -1149.9% -1149.9% -134217728 115.0% HPHP::AsyncFuncImpl::start
80464115 -68.9% -1218.9% -5519796   4.7% HPHP::StringData::MakeUncounted
40589307 -34.8% -1253.6% 19751314 -16.9% HPHP::MemoryManager::newSlab
29890192 -25.6% -1279.2% 29890192 -25.6% mallocx
25690128 -22.0% -1301.2%        0   0.0% HPHP::jit::CodeGenFixups::process_only

So it appears that HPHP::StringData::MakeUncounted is what stores most of the unfreed memory, and this only happens on the appservers. As stated by @ori on the preceding ticket, this method seems to be tied to the use of apc, and for some reason the offending use seems to be not present on apis.

I'll dig deeper from here.

Joe claimed this task.Jul 6 2015, 7:47 AM
Joe set Security to None.
Joe added a comment.Jul 6 2015, 7:54 AM

This surely looks like something doesn't work well within HHVM - in the code for StribngData::MakeUncounted I see the following comment:

// create either a static or an uncounted string.                                                                                                                                                                 
// Diffrence between static and uncounted is in the lifetime                                                                                                                                                      
// of the string. Static are alive for the lifetime of the process.                                                                                                                                               
// Uncounted are not ref counted but will be deleted at some point.

and the main difference seems to be the use of either low_malloc (for Static strings) or malloc (for Uncounted strings), See https://github.com/wikimedia/operations-debs-hhvm/blob/master/hphp/runtime/base/string-data.cpp#L111.

So what seems to happen is that for some reason we don't really understand those strings don't get cleaned up, and they remain to occupy HHVM's memory indefinitely.

Joe added a comment.Jul 6 2015, 8:43 AM

I just noticed we don't set hhvm.server.apc.expire_on_sets which defaults to false... this is most probably the culprit here.

lcawte added a subscriber: lcawte.Jul 6 2015, 8:48 AM

Change 223000 had a related patch set uploaded (by Giuseppe Lavagetto):
hhvm: enable apc items expiration on the canary appservers

https://gerrit.wikimedia.org/r/223000

Joe added a comment.Jul 6 2015, 3:24 PM

tried an apc-dump on mw1060 (a server that shows significant use of memory) and it created a 5.3 GB file, while on the API appservers the dump is less than 200 MBs even when running since a long time. So APC purging is surely our more pressing problem. Unluckily, it seems that activating the setting (as I did manually on mw1059) doesn't really affect our APC storage, but I guess it's to early to tell.

Anyways, we should go on with the proposed change.

Change 223000 abandoned by Giuseppe Lavagetto:
hhvm: enable apc items expiration on the canary appservers

Reason:
I was sure this wasn't applied since the APC indefinite growth we show is clear.

https://gerrit.wikimedia.org/r/223000

Joe added a comment.Jul 7 2015, 7:09 AM

So, the change was already applied precedingly. However, some testing showed me that when a key has expired it is shown in the apc dump as follows:

<key-name> ####

with no value. I've searched for those in the APC cache on mw 1059:

mw1059:~# cat  /tmp/apc_dump | perl -ne 'print $_ if /\#+\s+$/;' | wc -l
119
mw1059:~# wc -l /tmp/apc_dump
577374 /tmp/apc_dump

so eviction of expired keys surely works. What obviously doesn't work is that HHVM has no LRU mechanism for APC entries with no TTL, so we're actively starving its memory resources. I also suspect this way we're making it much more inefficient once enough keys are present there.

Looking at the common keys in the dump I see a recurrence of keys like

global:resourceloader:filter:minify-js:7:d60819601c91f9b520220525da1e68d2 
global:resourceloader:filter:minify-css:7:fe8a9329f861cea0418795201bdbbee8

which are clearly non-evictable (no TTL) or with a very long TTL.

Joe added a subscriber: MaxSem.EditedJul 7 2015, 7:18 AM

As @MaxSem noticed, the key is set here:

https://github.com/wikimedia/mediawiki/blob/9771b003756bfe3825bf7427efca6393ed96597b/includes/resourceloader/ResourceLoader.php#L227

without a TTL. Also, we explicitly use APC on HHVM.

IMHO, a band-aid solution would be to set a reasonable TTL to the keys, and a long-term one would be to add an APC cache size setting to HHVM and then use LRU evictions there. I have no idea of what "reasonable" would mean in this context.

Joe added a comment.Jul 7 2015, 3:53 PM

I also created https://github.com/facebook/hhvm/issues/5601 to track what is in my opinion a defect in HHVM anyways.

Change 223343 had a related patch set uploaded (by Krinkle):
resourceloader: Set a TTL for minification cache entries

https://gerrit.wikimedia.org/r/223343

Change 223343 merged by jenkins-bot:
resourceloader: Set a TTL for minification cache entries

https://gerrit.wikimedia.org/r/223343

Change 223373 had a related patch set uploaded (by Krinkle):
resourceloader: Set a TTL for minification cache entries

https://gerrit.wikimedia.org/r/223373

Change 223375 had a related patch set uploaded (by Krinkle):
resourceloader: Set a TTL for minification cache entries

https://gerrit.wikimedia.org/r/223375

Change 223373 merged by jenkins-bot:
resourceloader: Set a TTL for minification cache entries

https://gerrit.wikimedia.org/r/223373

Change 223375 merged by jenkins-bot:
resourceloader: Set a TTL for minification cache entries

https://gerrit.wikimedia.org/r/223375

Joe added a subscriber: Krinkle.Jul 10 2015, 6:19 AM

So, @Krinkle's patchset did have an effect in slowing down the growth on the memory occupation, but didn't stop it.

The reason for that is that apparently hhvm is not purging the expired items fast enough. At the moment on a random appserver the amount of expired but unpurged keys amounts to half of the total keys:

oblivian@mw1026:~$ hhvmadm /dump-apc
Done
oblivian@mw1026:~$ cat  /tmp/apc_dump | perl -ne 'print "$_" if /(.*?)\#{4}\s*$/;' | wc -l
43155
oblivian@mw1026:~$ cat  /tmp/apc_dump | perl -ne 'print "$_" if /(.*?)\#{4}/;' | wc -l
84963

I'll experiment on this a little and give a better read to the code, but this looks like yet another HHVM bug (or some config lever we didn't fully understand must be tweaked)

Joe added a comment.Jul 10 2015, 10:50 AM

I re-did the test now, I'm seeing that the ratio got worse, moving from ~ 51% to 54% over the last few hours, so it's clear we don't purge (enough? any?) evicted key

Joe added a comment.Jul 10 2015, 1:26 PM

I tried to reproduce what I'm seeing in production in a testing environment, and I can't seem to trigger that behaviour:

Keys that expire get purged at every multiple of hhvm.server.apc.purge_frequency calls to apc_set. I even tried to set a very large number of keys to see if this had any effect, and it didn't.

I am a bit in the dark at this point -we're clearly seeing some misbehaviour but I can't track it down to a specific reason, it seems.

Change 224088 had a related patch set uploaded (by Giuseppe Lavagetto):
hhvm: set apc expiration correctly

https://gerrit.wikimedia.org/r/224088

Change 224088 merged by Giuseppe Lavagetto:
hhvm: set apc expiration correctly

https://gerrit.wikimedia.org/r/224088

Joe added a comment.Jul 10 2015, 4:26 PM

Turns out I couldn't reproduce it because well, we did set the wrong hhvm ini key :P

Joe added a comment.Jul 11 2015, 5:22 PM

For the first time, I've seen today expired keys getting expunged. It seems promising but I guess we'll have a definitive answer tomorrow - I expect the memory usage to stabilize around a regimen value.

Joe added a comment.Jul 12 2015, 4:57 AM

As can be seen from the graph


it seems my prediction was confirmed, so the bug is clearly resolved.

I would be tempted to set a global upper limit to the ttl of APC keys . thus not resolving the ticket, but i'll declass it to normal priority now.

Joe lowered the priority of this task from Unbreak Now! to Medium.Jul 12 2015, 4:57 AM

Change 225858 had a related patch set uploaded (by Giuseppe Lavagetto):
hhvm: expire APC keys after 7 days

https://gerrit.wikimedia.org/r/225858

Change 225858 merged by Giuseppe Lavagetto:
hhvm: expire APC keys after 2 days

https://gerrit.wikimedia.org/r/225858

Joe closed this task as Resolved.Jul 23 2015, 7:14 AM