Page MenuHomePhabricator

HHVM lock-ups
Closed, ResolvedPublic

Description

Upstream HHVM bug: #1402 deadlock with tbb

HHVM has been locking up in production, typically right after a big deployment which touches lots of file. Typically only one or two app servers are affected. I don't have exact numbers, but I believe that this has happened at least a dozen times.

quickstack shows many threads stuck in __lll_lock_wait, many with HPHP::StatCache::refresh in the stack.

Attaching a backtrace from mw1141.

Event Timeline

ori raised the priority of this task from to Needs Triage.
ori updated the task description. (Show Details)
ori added a project: HHVM.
ori added subscribers: ori, Joe, tstarling.
Aklapper added a project: Deployments.
Aklapper set Security to None.

With everything waiting on a lock I'd suspect some kind of deadlock, but a lot of those thread backtraces don't make any sense. The one for Thread 4 in particular claims to be in code that only runs when you're building a repo ahead of time. I also don't know what's up with the threads that claim to be calling __ll_lock_wait from vector::emplace_back.

I'll poke around in the StatCache locking code a little bit to see if there's anything obviously fishy going on.

Quoting the description from @ori:

HHVM has been locking up in production, typically right after a big deployment which touches lots of file. Typically only one or two app servers are affected. I don't have exact numbers, but I believe that this has happened at least a dozen times.

Is this still happening, @ori @Joe @someone?

Quoting the description from @ori:

HHVM has been locking up in production, typically right after a big deployment which touches lots of file. Typically only one or two app servers are affected. I don't have exact numbers, but I believe that this has happened at least a dozen times.

Is this still happening, @ori @Joe @someone?

Yes, it's still happening.

I spent some time digging on this just from a quickstack trace and looking at HHVM source on github, but I haven't had a chance to try to correlate with looking at a live locked-up HHVM. I haven't really figured anything definite out yet, just recording the threads-of-hints I'm looking at so that maybe I or someone else can run deeper with some of this later, or find a correlation elsewhere...

Thread 4 (LWP 9458):
#01  0x00007fa43baa63f7 in __GI_sched_yield () from /build/buildd/eglibc-2.19/posix/../sysdeps/unix/syscall-template.S:81
#02  0x00000000009422d5 in tbb::interface5::concurrent_hash_map<std::string, HPHP::AtomicSmartPtr<HPHP::StatCache::Node>, HPHP::stringHashCompare, tbb::tbb_allocator<std::pair<std::string, HPHP::AtomicSmartPtr<HPHP::StatCache::Node> > > >::node::node () from /usr/bin/hhvm
#03  0x00000000010933ab in HPHP::invalidatePath () from /usr/bin/hhvm
#04  0x00000000009440a4 in HPHP::StatCache::Node::touchLocked<true> () from /usr/bin/hhvm
#05  0x000000000094454c in HPHP::StatCache::Node::expirePaths () from /usr/bin/hhvm
#06  0x0000000000948644 in HPHP::StatCache::handleEvent () from /usr/bin/hhvm
#07  0x00000000009494ea in HPHP::StatCache::requestInit () from /usr/bin/hhvm
#08  0x00000000008e437c in HPHP::hphp_session_init () from /usr/bin/hhvm
#09  0x0000000000b458bc in HPHP::HttpRequestHandler::handleRequest () from /usr/bin/hhvm
#10  0x0000000000b607aa in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJobImpl () from /usr/bin/hhvm
#11  0x0000000000b60d1e in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJob () from /usr/bin/hhvm
#12  0x0000000000b5f366 in HPHP::JobQueueWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::Server*, true, false, HPHP::JobQueueDropVMStack>::start () from /usr/bin/hhvm
#13  0x0000000001b5e017 in HPHP::AsyncFuncImpl::ThreadFunc () from /usr/bin/hhvm
#14  0x00000000008a27d7 in HPHP::start_routine_wrapper () from /usr/bin/hhvm
#15  0x00007fa43c5c8182 in start_thread () from /build/buildd/eglibc-2.19/nptl/pthread_create.c:312

Thread 3 (LWP 9462):
#01  0x00007fa43c5cef2c in __lll_lock_wait () from /build/buildd/eglibc-2.19/nptl/../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#02  0x0000000000c30b68 in HPHP::Treadmill::startRequest () from /usr/bin/hhvm
#03  0x00000000008e437c in HPHP::hphp_session_init () from /usr/bin/hhvm
#04  0x0000000000b458bc in HPHP::HttpRequestHandler::handleRequest () from /usr/bin/hhvm
#05  0x0000000000b607aa in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJobImpl () from /usr/bin/hhvm
#06  0x0000000000b60d1e in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJob () from /usr/bin/hhvm
#07  0x0000000000b5f366 in HPHP::JobQueueWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::Server*, true, false, HPHP::JobQueueDropVMStack>::start () from /usr/bin/hhvm
#08  0x0000000001b5e017 in HPHP::AsyncFuncImpl::ThreadFunc () from /usr/bin/hhvm
#09  0x00000000008a27d7 in HPHP::start_routine_wrapper () from /usr/bin/hhvm
#10  0x00007fa43c5c8182 in start_thread () from /build/buildd/eglibc-2.19/nptl/pthread_create.c:312

The HPHP::Treadmill::startRequest in thread 3 above ends up holding a GenCountGuard g;. Over in thread 4 (which is one of 3 threads stuck down in StatCache code somewhere directly), the critical bit of info here is HPHP::StatCache::Node::touchLocked calls back down to HPHP::invalidatePath, which hits the code in Treadmill::enqueue (not obvious in stacktrace, but is in the code), which through the magic of C++ probably hits Treadmill::enqueueInternal, which also holds a GenCountGuard g;.

It's possible that somehow a single thread is deadlocking with itself by taking GenCountGuard at some outer scope, and then again within the deeper scope, and that the mutex isn't recursive. Maybe. It does seem like a bad-smelling pattern that so deep within indirection via StatCache, we end up taking this lock that seems to normally be taken at a much higher scope level. If one thread self-deadlocked on a non-recursive mutex like that, it could in turn hold up all other threads who eventually wait on the same.

If it's really a two-lock/two-thread deadlock, my guess at this point is that the second lock (along with the GenCountGuard above) might be the m_lock at the top of StatCache::refresh, but I've forgotten where I picked up the related thread of execution that lead into that at this point...

Another parting thought: if we have an easy way to make small source changes and run them live, it might be worth switching the initialization of s_genLock (which is at the core of GenCountGuard), to PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP at https://github.com/facebook/hhvm/blob/7cab5658d9346a535b82c0204676b175e1d2761e/hphp/runtime/vm/treadmill.cpp#L55 , and also checking the return value of pthread_mutex_lock at line 108 of the same file and then dying with some horrible unique error on EDEADLK there, we might gain some evidence whether the self-deadlock theory there holds any water.

Looked at a live one on mw1169 today. I'm back to thinking this is all within StatCache somewhere, not the GenCountGuard stuff.

In this example we've got many threads stuck on pthread mutex 0x2982578. There's one somewhat-special thread stuck on pthread mutex 0x7ff32d702f98 whose traces are a bit different. But ignoring that for now, let's focus on the two mutexes...

Both of these mutexes are owned by LWP 1061:

(gdb) p ((pthread_mutex_t*)0x2982578)->__data.__owner
$2 = 1061
(gdb) p ((pthread_mutex_t*)0x7ff32d702f98)->__data.__owner
$3 = 1061

LWP 1061 is also special/different in the stracktraces, and is the only one not stuck explicitly on a mutex lock, and is instead stuck in some kind of loop over sched_yield...

quickstack:

Thread 128 (LWP 1061):
#01  0x00007ff37b0703f7 in __GI_sched_yield () from /build/buildd/eglibc-2.19/posix/../sysdeps/unix/syscall-template.S:81
#02  0x00000000009422d5 in tbb::interface5::concurrent_hash_map<std::string, HPHP::AtomicSmartPtr<HPHP::StatCache::Node>, HPHP::stringHashCompare, tbb::tbb_allocator<std::pair<std::string, HPHP::AtomicSmartPtr<HPHP::StatCache::Node> > > >::node::node () from /usr/bin/hhvm
#03  0x00000000010933ab in HPHP::invalidatePath () from /usr/bin/hhvm
#04  0x00000000009440a4 in HPHP::StatCache::Node::touchLocked<true> () from /usr/bin/hhvm
#05  0x000000000094454c in HPHP::StatCache::Node::expirePaths () from /usr/bin/hhvm
#06  0x0000000000948644 in HPHP::StatCache::handleEvent () from /usr/bin/hhvm
#07  0x00000000009494ea in HPHP::StatCache::requestInit () from /usr/bin/hhvm
#08  0x00000000008e437c in HPHP::hphp_session_init () from /usr/bin/hhvm
#09  0x0000000000b458bc in HPHP::HttpRequestHandler::handleRequest () from /usr/bin/hhvm
#10  0x0000000000b607aa in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJobImpl () from /usr/bin/hhvm
#11  0x0000000000b60d1e in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJob () from /usr/bin/hhvm
#12  0x0000000000b5f366 in HPHP::JobQueueWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::Server*, true, false, HPHP::JobQueueDropVMStack>::start () from /usr/bin/hhvm
#13  0x0000000001b5e017 in HPHP::AsyncFuncImpl::ThreadFunc () from /usr/bin/hhvm
#14  0x00000000008a27d7 in HPHP::start_routine_wrapper () from /usr/bin/hhvm
#15  0x00007ff37bb92182 in start_thread () from /build/buildd/eglibc-2.19/nptl/pthread_create.c:312

Which is this in gdb's backtrace terms (they seem to enumerate threads different, but both are LWP (PID) 1061):

(gdb) thread 127
[Switching to thread 127 (Thread 0x7ff33c7ff700 (LWP 1061))]
#0  0x00007ff37b0703f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
81	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ff37b0703f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000009422d5 in tbb::interface5::concurrent_hash_map<std::string, HPHP::AtomicSmartPtr<HPHP::StatCache::Node>, HPHP::stringHashCompare, tbb::tbb_allocator<std::pair<std::string, HPHP::AtomicSmartPtr<HPHP::StatCache::Node> > > >::lookup (this=this@entry=0x2982100 <HPHP::StatCache::s_sc>, key=..., result=result@entry=0x7ff33c7fb070, write=write@entry=true, t=0x0, op_insert=false)
    at /usr/include/tbb/concurrent_hash_map.h:1039
#2  0x00000000009440a4 in find (key=..., result=..., this=0x2982100 <HPHP::StatCache::s_sc>) at /usr/include/tbb/concurrent_hash_map.h:868
#3  removePath (node=0x7ff32d702f80, path=..., this=0x2982100 <HPHP::StatCache::s_sc>) at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:630
#4  HPHP::StatCache::Node::touchLocked<true> (this=0x7ff32d702f80, invalidate=true) at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:197
#5  0x000000000094454c in HPHP::StatCache::Node::expirePaths (this=0x7ff32d702f80, invalidate=invalidate@entry=true) at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:244
#6  0x0000000000948644 in HPHP::StatCache::handleEvent (this=this@entry=0x2982100 <HPHP::StatCache::s_sc>, event=event@entry=0x29826b4 <HPHP::StatCache::s_sc+1460>)
    at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:596
#7  0x00000000009494ea in refresh (this=0x2982100 <HPHP::StatCache::s_sc>) at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:667
#8  HPHP::StatCache::requestInit () at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:901
#9  0x00000000008e437c in HPHP::hphp_session_init () at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/program-functions.cpp:1871
#10 0x0000000000b458bc in HPHP::HttpRequestHandler::handleRequest (this=0x7ff33ca3fa10, transport=<optimized out>) at /usr/src/oblivian/debs/hhvm/hphp/runtime/server/http-request-handler.cpp:375
#11 0x0000000000b607aa in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJobImpl (this=0x7ff3560211c0, job=..., abort=<optimized out>)
    at /usr/src/oblivian/debs/hhvm/hphp/runtime/server/server-worker.h:107
#12 0x0000000000b60d1e in HPHP::ServerWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::FastCGITransportTraits>::doJob (this=<optimized out>, job=...)
    at /usr/src/oblivian/debs/hhvm/hphp/runtime/server/server-worker.h:57
#13 0x0000000000b5f366 in HPHP::JobQueueWorker<std::shared_ptr<HPHP::FastCGIJob>, HPHP::Server*, true, false, HPHP::JobQueueDropVMStack>::start (this=0x7ff3560211c0)
    at /usr/src/oblivian/debs/hhvm/hphp/util/job-queue.h:463
#14 0x0000000001b5e017 in threadFuncImpl (this=0x7ff373765200) at /usr/src/oblivian/debs/hhvm/hphp/util/async-func.cpp:131
#15 HPHP::AsyncFuncImpl::ThreadFunc (obj=0x7ff373765200) at /usr/src/oblivian/debs/hhvm/hphp/util/async-func.cpp:51
#16 0x00000000008a27d7 in HPHP::start_routine_wrapper (arg=0x7ff35603b3a0) at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/thread-hooks.cpp:93
#17 0x00007ff37bb92182 in start_thread (arg=0x7ff33c7ff700) at pthread_create.c:312
#18 0x00007ff37b09f47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Digging into this a bit further in the tbb source code, we see a block of code like this where the yield is happening:

if( !result ) goto check_growth;
        // TODO: the following seems as generic/regular operation
        // acquire the item
        if( !result->try_acquire( n->mutex, write ) ) {
            for( tbb::internal::atomic_backoff backoff(true);; ) {
                if( result->try_acquire( n->mutex, write ) ) break;
                if( !backoff.bounded_pause() ) {
                    // the wait takes really long, restart the operation
                    b.release();
                    __TBB_ASSERT( !op_insert || !return_value, "Can't acquire new item in locked bucket?" );
                    __TBB_Yield();
                    m = (hashcode_t) itt_load_word_with_acquire( my_mask );
                    goto restart;
                }
            }
        }

The tricky thing is ascertaining what n->mutex really really is through all the layers of templating and inheritance and such. I have yet to get gdb to actually print that for me with any random syntax variation I try...

(gdb) p n
$17 = (tbb::interface5::concurrent_hash_map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, HPHP::AtomicSmartPtr<HPHP::StatCache::Node>, HPHP::stringHashCompare, tbb::tbb_allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, HPHP::AtomicSmartPtr<HPHP::StatCache::Node> > > >::node *) 0x7ff37210fce0
(gdb) p *n
$18 = {<tbb::interface5::internal::hash_map_node_base> = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, next = 0x0, mutex = {static is_rw_mutex = true,
      static is_recursive_mutex = false, static is_fair_mutex = false, static WRITER = 1, static WRITER_PENDING = 2, static READERS = -4, static ONE_READER = 4, static BUSY = -3, state = 4}}, item = {first = {
      static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x7ff32d673668 "/srv/mediawiki/wmf-config/InitialiseSettings.php"}}, second = {m_px = 0x7ff32d702f80}}}
(gdb) p n.mutex
A syntax error in expression, near `'.
(gdb) p (*n).mutex
A syntax error in expression, near `'.

... but my hunch is it ends up being the m_lock member of HPHP::StatCache::Node, which is what the members of the hash map it's accessing are, which is one of the two we looked at earlier, as you'll see by the address below. The hunch is based on correlating the _M_p member above's path to InitialiseSettings.php with the _M_p path shown below in the Node object as well:

(gdb) frame 5
#5  0x000000000094454c in HPHP::StatCache::Node::expirePaths (this=0x7ff32d702f80, invalidate=invalidate@entry=true) at /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp:244
244	in /usr/src/oblivian/debs/hhvm/hphp/runtime/base/stat-cache.cpp
(gdb) p &this.m_lock.m_mutex
$8 = (pthread_mutex_t *) 0x7ff32d702f98
(gdb) p this.m_lock.m_mutex.__data.__owner
$12 = 1061
(gdb) p this.m_path
$20 = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
    _M_p = 0x7ff2ba2c04d8 "/srv/mediawiki/wmf-config/InitialiseSettings.php"}}

In some sense, when the problem is viewed at this narrow scope, I think it's because this lock is not-recursive that the thread ends up looping forever trying to acquire it and yielding down in the hash map, while it already holds it from a higher-level scope.

So, a completely-ignorant workaround that might have horrific side effects would be to switch that lock (StatCache::Node::m_lock) to a recursive one. It's defined here: https://github.com/facebook/hhvm/blob/2aa2b5c63eff05a5c119cc83e35fd3b28f5fe906/hphp/runtime/base/stat-cache.cpp#L169 , and you'd just change the first parameter there from false to true it looks like? If it were recursive, the hash map acquiring it while the same thread already holds it wouldn't be an issue, so the acquisition would succeed there instead of looping and yielding waiting on itself.

It's hard to say since I don't have a good grasp of the overall design/picture here, I'm just staring at low-level details and trying to infer what's going on. But, I think it's worth trying, in any case.

did this get fixed upstream? afaik we're not experiencing hhvm lockups now in production even on big deploys and there was work around statcache

I'm not aware of any fixes for this specific issue. I had the original author of StatCache take a look at @BBlack's comments and he said it shouldn't be possible without some kind of memory corruption, which could've been fixed (or perturbed out of existence).

No, this specific type of locking has gone away with HHVM 3.6.x;

Closing the ticket.

Still happens occasionally.

Mentioned in SAL [2016-02-20T00:32:53Z] <ori> Restarted HHVM on mw1248 (locked up, T89912).

Mentioned in SAL (#wikimedia-operations) [2016-09-23T00:38:22Z] <Krenair> mw1224 apache stuck, not restarting for now in case someone wants to investigate later. possibly T89912?

19:10 < mutante> !log mw1206, mw1224 - restarted hhvm and apache

19:10 < icinga-wm> RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 75236 bytes in 0.174 second response time
19:11 < icinga-wm> RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 75228 bytes in 0.167 second response time

Recurrences still happen from time to time, last one on mw1204 today.

hashar added a subscriber: hashar.

HHVM 3.18 has a similar deadlock that happens after just a few hours of live traffic. T161684 Most probably the same deadlock.

MoritzMuehlenhoff claimed this task.

I'm quite confident that this is the same bug as T161684 (which also turned out to be stat_cache related after some investigation) which has now been fixed upstream . I'm marking this task as fixed already (and T161684 as fixed when the HHVM 3.18.2 rollout is completed)