Page MenuHomePhabricator

HHVM is leaking memory on the API appservers
Closed, DeclinedPublic

Description

We are currently seeing a continuing and pretty severe memory leak on the API appserver cluster only. I am unsure when this leak started, but it surely got more serious since we've removed a few servers from the cluster on April 1st:

if a HHVM server doesn't crash, its occupied memory grows approximately by 2 Gbs/day. Moreover, when memory usage grows, the older appservers crash and need a soft reboot, the newer ones spike in terms of cpu usage.

IIRC, we memory profiling is problematic on HHVM 3.12, but I am going to try to gather some data anyways today.

Event Timeline

Joe triaged this task as High priority.Apr 26 2016, 1:36 PM

The admin module of HHVM has quite a few diagnostics on the status of the memory usage in hhvm; So what I found on one of the affected machines, where hhvm were occupying 10 gigabytes of memory, while allocating 27 gb of virtual memory:

  • APC is using a significant yet small amount of available memory
  • the PCRE cache is less than 100 Mb
  • the TC cache has 870 Mb dedicated to it and it's heavilly underused
  • Static Strings take up around 100 Mb

and there is not really any common outlier to justify this memory consumption.

I can confirm that with heap profiling activated HHVM crashes too often to make collecting heap diffs significant.

  • APC is using a significant yet small amount of available memory

Do we need to reduce the amount of things we're putting into APC? Or increase how much APC space there is..?

Is this the same as the recent behavior seen on various API appservers? https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2

Some of the older appservers (with 32 threads) have an average CPU usage of >80%, where the high increase started around September 7.. Memory leaks are also present here.

Mentioned in SAL (#wikimedia-operations) [2016-09-17T16:40:48Z] <_joe_> rolling restart of HHVM on part fo the API cluster in eqiad, T133674

Krinkle added a subscriber: Krinkle.

Declining per T192166 / T176370.