Page MenuHomePhabricator

HHVM 3.6 leaks memory
Closed, DuplicatePublic

Description

HHVM 3.6 leaks memory, so that after about 3/4 days of running it is going to occupy most of the available memory on the smaller appservers. This doesn't seem to affect the API cluster though.

Options we have are:

  • We dedicate resources and effort to solving this
  • We roll back to 3.3.1

Event Timeline

Joe created this task.May 18 2015, 2:21 PM
Joe raised the priority of this task from to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe added projects: acl*sre-team, HHVM.
Joe added a subscriber: Joe.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2015, 2:21 PM
Joe set Security to None.May 18 2015, 2:22 PM
Joe added subscribers: ori, tstarling.
ori claimed this task.May 18 2015, 5:48 PM
ori added a comment.May 18 2015, 6:22 PM

Investigation log. Set up mw1107 to write a heap dump every ten minutes:

#!/bin/bash
# Request a heap profile from HHVM once every ten minutes.
# Profiles are written to /tmp/heaps.

# Create a directory for heap dumps
sudo -u www-data mkdir -m0755 -p /tmp/heaps

# Enable jemalloc heap profiling
sudo ln -sf "prof:true" /etc/malloc.conf

# Restart HHVM
service hhvm restart

# Allow HHVM to warm up
sleep 60

# Activate jemalloc heap profiling
{ curl -s "localhost:9002/jemalloc-prof-activate" | grep -q OK ; } && {
    curl -s "localhost:9002/jemalloc-prof-status" | grep -q ACTIVE ; } || {
        echo >&2 Could not activate jemalloc profiling.
        exit 1
}

# Get a heap dump every 10 minutes
for (( i=1; ; i++ )); do
    curl "localhost:9002/jemalloc-prof-dump?file=/tmp/heaps/${i}.heap"
    sleep 600
done

Next thing I'll do is diff heaps, using the same technique I described in https://phabricator.wikimedia.org/T820#15479

ori added a comment.May 18 2015, 8:09 PM

OK, I have about 90 minutes' worth of heaps, taken at 10-minute intervals. I'll run this script to diff them:

#!/bin/bash
for (( i=1; ; i++ )); do
    base="$((i)).heap" prof="$((i+1)).heap"
    [ -e "$prof" ] || break
    echo "-------------------- $base => $prof --------------------"
    google-pprof `which hhvm` --show_bytes --text --base="$base" "$prof" 2>/dev/null | head -6
done

Here's the output:

-------------------- 1.heap => 2.heap --------------------
Total: 150695798 B
134217728  89.1%  89.1% 134217728  89.1% HPHP::AsyncFuncImpl::start
30838224  20.5% 109.5% 30838224  20.5% HPHP::StringData::MakeUncounted
 5298396   3.5% 113.0%  5298396   3.5% HPHP::jit::SrcRec::emitFallbackJump
 2097376   1.4% 114.4%  2097376   1.4% mallocx
 2097376   1.4% 115.8%  4194656   2.8% pcre_study
-------------------- 2.heap => 3.heap --------------------
Total: 14722178 B
35113019 238.5% 238.5% 35113019 238.5% HPHP::StringData::MakeUncounted
 4238746  28.8% 267.3%  4238746  28.8% HPHP::jit::SrcRec::emitFallbackJump
 4194560  28.5% 295.8%  4194560  28.5% _pcre_jit_compile
 1049408   7.1% 302.9%  1049408   7.1% HPHP::MixedArray::NonSmartCopy
 1048768   7.1% 310.0%  1573136  10.7% HPHP::FuncEmitter::create
-------------------- 3.heap => 4.heap --------------------
Total: 55438578 B
27432546  49.5%  49.5% 27432546  49.5% HPHP::StringData::MakeUncounted
10681396  19.3%  68.7% 10681396  19.3% HPHP::MemoryManager::newSlab
 6293672  11.4%  80.1%  6293672  11.4% pcre_compile2
 4200123   7.6%  87.7%  4200123   7.6% 00007f2443195028
 4194432   7.6%  95.2% 11542108  20.8% HPHP::PCRECache::insert
-------------------- 4.heap => 5.heap --------------------
Total: 70328527 B
29433586  41.9%  41.9% 29433586  41.9% HPHP::StringData::MakeUncounted
27771631  39.5%  81.3% 27771631  39.5% HPHP::MemoryManager::newSlab
 4194303   6.0%  87.3%  4194303   6.0% HPHP::jit::SrcDB::insert
 2621520   3.7%  91.0%  7336199  10.4% HPHP::PCRECache::insert
 2618073   3.7%  94.8%  2618073   3.7% 00007f2443195028
-------------------- 5.heap => 6.heap --------------------
Total: 49557841 B
32044189  64.7%  64.7% 32044189  64.7% HPHP::MemoryManager::newSlab
27429131  55.3% 120.0% 27429131  55.3% HPHP::StringData::MakeUncounted
 1065045   2.1% 122.2%  1065045   2.1% libmemcached_string_hash
 1062978   2.1% 124.3%  1062978   2.1% my_malloc
  534594   1.1% 125.4%   534594   1.1% __server_create_with
-------------------- 6.heap => 7.heap --------------------
Total: 347493495 B
335544320  96.6%  96.6% 335544320  96.6% HPHP::AsyncFuncImpl::start
24503318   7.1% 103.6% 24503318   7.1% HPHP::StringData::MakeUncounted
 4719583   1.4% 105.0%  4719583   1.4% pcre_compile2
 2664949   0.8% 105.7%  2664949   0.8% HPHP::ExecutionContext::requestInit
 2621720   0.8% 106.5%  2621720   0.8% pcre_study
-------------------- 7.heap => 8.heap --------------------
Total: -23987682 B
21333350 -88.9% -88.9% 21333350 -88.9% HPHP::StringData::MakeUncounted
 3146064 -13.1% -102.0%  5767664 -24.0% pcre_study
 2621600 -10.9% -113.0%  2621600 -10.9% _pcre_jit_compile
 2059432  -8.6% -121.6%  2059432  -8.6% HPHP::jit::ProfData::addTransNonProf
 1048608  -4.4% -125.9% -8393971  35.0% HPHP::PCRECache::insert
-------------------- 8.heap => 9.heap --------------------
Total: 120492045 B
91860010  76.2%  76.2% 91860010  76.2% HPHP::MemoryManager::newSlab
15475786  12.8%  89.1% 15475786  12.8% HPHP::StringData::MakeUncounted
 3675130   3.1%  92.1%  3675130   3.1% pcre_compile2
 3670856   3.0%  95.2%  3670856   3.0% 00007f2443195028
 2115671   1.8%  96.9%  2115671   1.8% HPHP::jit::SrcRec::emitFallbackJump

HPHP::StringData::MakeUncounted looks suspicious, so the next thing I'll do is use gdb to set a breakpoint on this method and capture some backtraces.

ori added a comment.May 18 2015, 8:23 PM

I ran gdb via sudo gdb -p "$(pidof -s hhvm)" and set a breakpoint via break HPHP::StringData::MakeUncounted. The stack traces look like this:

#0  HPHP::StringData::MakeUncounted
#1  HPHP::APCHandle::Create
#2  HPHP::ConcurrentTableSharedStore::storeImpl
#3  HPHP::ConcurrentTableSharedStore::set
#4  HPHP::f_apc_store

So, this looks APC related. Next I'll try to figure out if this is actually a leak or just us misusing APC somehow.

In T99525#1294789, @ori wrote:

Next I'll try to figure out if this is actually a leak or just us misusing APC somehow.

@ori: Did you have any luck with that, after merging rOPUPacacf97e2df962fef83487a461f3559fa07e4d6f ?

(No updates on this "Unbreak now!" priority task for three weeks, hence asking.)

ori lowered the priority of this task from Unbreak Now! to Normal.Jun 15 2015, 8:12 AM
In T99525#1294789, @ori wrote:

Next I'll try to figure out if this is actually a leak or just us misusing APC somehow.

@ori: Did you have any luck with that, after merging rOPUPacacf97e2df962fef83487a461f3559fa07e4d6f ?
(No updates on this "Unbreak now!" priority task for three weeks, hence asking.)

It hasn't gone away completely, but it occurs infrequently enough now that the priority can be downgraded. Thanks for the poke.

faidon added a subscriber: faidon.Jun 15 2015, 3:18 PM

Didn't this cause a flood of alerts just this past Thursday and required manual action to fix it? Or am I confusing two different issues?

hashar moved this task from Backlog to Defect on the HHVM board.Jun 24 2015, 8:23 AM