Page MenuHomePhabricator

HHVM 3.12 has a race-condition when starting up
Closed, ResolvedPublic

Description

When hhvm crashes, upstart tries to start it again; we've seen a few cases where hhvm will fail to start correctly; grabbing a couple of quickstack dumps showed HHVM is stuck in a deadlock at startup:

Thread 2 (LWP 26568):
#01  0x00007f49f8fbe680 in sem_wait () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:84
#02  0x00000000016a4e58 in HPHP::s_waitThread () from /usr/bin/hhvm
#03  0x0000000000ce0ff7 in HPHP::start_routine_wrapper () from /usr/bin/hhvm
#04  0x00007f49f8fb8182 in start_thread () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/pthread_create.c:312

Thread 1 (LWP 26560):
#01  0x00007f49f8fbef2c in __lll_lock_wait () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#02  0x00000000012bdd00 in HPHP::jit::sortBlocks () from /usr/bin/hhvm
#03  0x0000000001297e43 in HPHP::jit::X64Assembler::instrIM8 () from /usr/bin/hhvm
#04  0x0000000000c7af05 in std::vector<unsigned long, std::allocator<unsigned long> >::_M_default_append () from /usr/bin/hhvm
#05  0x0000000001195030 in HPHP::jit::simplify () from /usr/bin/hhvm
#06  0x000000000109cb48 in boost::dynamic_bitset<unsigned long, std::allocator<unsigned long> >::find_next () from /usr/bin/hhvm
#07  0x000000000109fb9a in HPHP::jit::allocateRegisters () from /usr/bin/hhvm
#08  0x0000000001282241 in HPHP::jit::optimizeX64 () from /usr/bin/hhvm
#09  0x0000000001181cc9 in HPHP::jit::Vauto::~Vauto () from /usr/bin/hhvm
#10  0x00000000010205a9 in HPHP::jit::Vauto::Vauto () from /usr/bin/hhvm
#11  0x000000000102061a in HPHP::jit::Vauto::Vauto () from /usr/bin/hhvm
#12  0x00000000010226f6 in HPHP::jit::UniqueStubs::emitAll () from /usr/bin/hhvm
#13  0x000000000104f987 in HPHP::jit::MCGenerator::initUniqueStubs () from /usr/bin/hhvm
#14  0x0000000000bf1ce5 in HPHP::ProcessInit () from /usr/bin/hhvm
#15  0x0000000000ca7fac in HPHP::hphp_process_init () from /usr/bin/hhvm
#16  0x0000000000e43aa5 in HPHP::HttpServer::HttpServer () from /usr/bin/hhvm
#17  0x0000000000cb4733 in virtual thunk to boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::invalid_option_value> >::rethrow() const () from /usr/bin/hhvm
#18  0x0000000000cb58af in HPHP::execute_program () from /usr/bin/hhvm
#19  0x0000000000bf19e0 in main () from /usr/bin/hhvm
.

while it would also be important to investigate the reason of crashes, this is most worrying tbh.

Related Objects

StatusSubtypeAssignedTask
ResolvedQgil
ResolvedKeegan
DeclinedNone
ResolvedQgil
ResolvedQgil
InvalidNone
InvalidNone
ResolvedNone
DeclinedQgil
ResolvedQgil
ResolvedNone
Resolvedbd808
DuplicateNone
ResolvedAddshore
ResolvedKrinkle
OpenNone
DeclinedAddshore
Resolvedbd808
ResolvedJoe
DeclinedNone
ResolvedJoe
ResolvedJoe
ResolvedJoe

Event Timeline

Rendering hhvms keep falling over like crazy. This is all reaction to icinga alerts on IRC basically:

13:20 bblack: mw1188: restarted hhvm (before alert hit IRC, was already pending in icinga)
13:19 bblack: mw1240: restarted hhvm
13:17 bblack: mw1122: restarted hhvm, killed stale procs from Jan24 looking like: /usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11;
13:15 bblack: restarting hhvm on mw1112
13:12 bblack: restarting hhvm on mw1248, mw1151
13:06 bblack: restarting hhvm on mw1251 ...
13:02 bblack: restarting hhvm on mw1258 ...
13:01 bblack: restarting hhvm on mw1243
12:59 bblack: restarting hhvm on mw1091 mw1249 mw1252 mw1242 ...
12:56 bblack: restarting hhvm on mw1217
12:01 _joe_: restarted hhvm on mw1122, stuck at startup

A wild guess: our hhvm-warmup job might have something to do with this - on mw1122 I found a hanging hhvm process created by the

/usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11;

so it *might* seem this has to do with deadlock accessing the jit from two separated processes

Mentioned in SAL [2016-03-10T18:13:13Z] <ori@tin> Synchronized php-1.27.0-wmf.15/includes/GlobalFunctions.php: wfShellExec() debug logging for T129467 (duration: 00m 28s)

Mentioned in SAL [2016-03-10T18:19:00Z] <ori@tin> Synchronized php-1.27.0-wmf.15/includes/GlobalFunctions.php: wfShellExec() debug logging for T129467 (take 2) (duration: 00m 26s)

A wild guess: our hhvm-warmup job might have something to do with this - on mw1122 I found a hanging hhvm process created by the

/usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11;

so it *might* seem this has to do with deadlock accessing the jit from two separated processes

I was able to reproduce this by just invoking PHP repeatedly with -r. The backtrace pointed to jemalloc profiling code as being the culprit. I looked to see if /etc/malloc.conf was present, and indeed it was a symlink to 'prof:true'. Once I removed the symlink, I could not reproduce the lock-ups.

In short: we can avoid running into this issue in prod by avoiding jemalloc memory profiling.

The hosts that had /etc/malloc.conf -> prof:true were

mw1122.eqiad.wmnet, mw1169.eqiad.wmnet, mw1015.eqiad.wmnet, and mw1107.eqiad.wmnet, which are exactly the ones that were repeatedly failing. I removed the symlink and restarted HHVM on each of those hosts.