HHVM is leaking memory on the API appservers
Open, HighPublic

Description

We are currently seeing a continuing and pretty severe memory leak on the API appserver cluster only. I am unsure when this leak started, but it surely got more serious since we've removed a few servers from the cluster on April 1st:

if a HHVM server doesn't crash, its occupied memory grows approximately by 2 Gbs/day. Moreover, when memory usage grows, the older appservers crash and need a soft reboot, the newer ones spike in terms of cpu usage.

IIRC, we memory profiling is problematic on HHVM 3.12, but I am going to try to gather some data anyways today.

Joe created this task.Apr 26 2016, 12:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2016, 12:18 PM
elukey added a subscriber: elukey.Apr 26 2016, 12:19 PM
Joe triaged this task as High priority.Apr 26 2016, 1:36 PM
Joe added a comment.Apr 26 2016, 1:44 PM

The admin module of HHVM has quite a few diagnostics on the status of the memory usage in hhvm; So what I found on one of the affected machines, where hhvm were occupying 10 gigabytes of memory, while allocating 27 gb of virtual memory:

  • APC is using a significant yet small amount of available memory
  • the PCRE cache is less than 100 Mb
  • the TC cache has 870 Mb dedicated to it and it's heavilly underused
  • Static Strings take up around 100 Mb

and there is not really any common outlier to justify this memory consumption.

Joe added a comment.EditedApr 26 2016, 2:22 PM

I can confirm that with heap profiling activated HHVM crashes too often to make collecting heap diffs significant.

  • APC is using a significant yet small amount of available memory

Do we need to reduce the amount of things we're putting into APC? Or increase how much APC space there is..?

Danny_B moved this task from Backlog to Defect on the HHVM board.May 29 2016, 11:24 PM

Is this the same as the recent behavior seen on various API appservers? https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2

Some of the older appservers (with 32 threads) have an average CPU usage of >80%, where the high increase started around September 7.. Memory leaks are also present here.

Mentioned in SAL (#wikimedia-operations) [2016-09-17T16:40:48Z] <_joe_> rolling restart of HHVM on part fo the API cluster in eqiad, T133674