HHVM 3.12 has a race-condition when starting up
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Mar 10 2016, 12:01 PM

Description

When hhvm crashes, upstart tries to start it again; we've seen a few cases where hhvm will fail to start correctly; grabbing a couple of quickstack dumps showed HHVM is stuck in a deadlock at startup:

Thread 2 (LWP 26568):
#01  0x00007f49f8fbe680 in sem_wait () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:84
#02  0x00000000016a4e58 in HPHP::s_waitThread () from /usr/bin/hhvm
#03  0x0000000000ce0ff7 in HPHP::start_routine_wrapper () from /usr/bin/hhvm
#04  0x00007f49f8fb8182 in start_thread () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/pthread_create.c:312

Thread 1 (LWP 26560):
#01  0x00007f49f8fbef2c in __lll_lock_wait () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#02  0x00000000012bdd00 in HPHP::jit::sortBlocks () from /usr/bin/hhvm
#03  0x0000000001297e43 in HPHP::jit::X64Assembler::instrIM8 () from /usr/bin/hhvm
#04  0x0000000000c7af05 in std::vector<unsigned long, std::allocator<unsigned long> >::_M_default_append () from /usr/bin/hhvm
#05  0x0000000001195030 in HPHP::jit::simplify () from /usr/bin/hhvm
#06  0x000000000109cb48 in boost::dynamic_bitset<unsigned long, std::allocator<unsigned long> >::find_next () from /usr/bin/hhvm
#07  0x000000000109fb9a in HPHP::jit::allocateRegisters () from /usr/bin/hhvm
#08  0x0000000001282241 in HPHP::jit::optimizeX64 () from /usr/bin/hhvm
#09  0x0000000001181cc9 in HPHP::jit::Vauto::~Vauto () from /usr/bin/hhvm
#10  0x00000000010205a9 in HPHP::jit::Vauto::Vauto () from /usr/bin/hhvm
#11  0x000000000102061a in HPHP::jit::Vauto::Vauto () from /usr/bin/hhvm
#12  0x00000000010226f6 in HPHP::jit::UniqueStubs::emitAll () from /usr/bin/hhvm
#13  0x000000000104f987 in HPHP::jit::MCGenerator::initUniqueStubs () from /usr/bin/hhvm
#14  0x0000000000bf1ce5 in HPHP::ProcessInit () from /usr/bin/hhvm
#15  0x0000000000ca7fac in HPHP::hphp_process_init () from /usr/bin/hhvm
#16  0x0000000000e43aa5 in HPHP::HttpServer::HttpServer () from /usr/bin/hhvm
#17  0x0000000000cb4733 in virtual thunk to boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::invalid_option_value> >::rethrow() const () from /usr/bin/hhvm
#18  0x0000000000cb58af in HPHP::execute_program () from /usr/bin/hhvm
#19  0x0000000000bf19e0 in main () from /usr/bin/hhvm
.

while it would also be important to investigate the reason of crashes, this is most worrying tbh.

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Keegan	T131689 Second iteration of the Technical Collaboration strategy
Declined	None	T926 Engage with established technical communities
Resolved	Qgil	T102790 Engineering Community quarterly review (as part of the Community Engagement review)
Resolved	Qgil	T93770 Engineering Community quarterly goals for April-June 2015
Invalid	None	T98348 Implement the Wikimedia Foundation Call to Action 2015
Invalid	None	T98359 Create spaces for future community-led innovations and new knowledge creation
Resolved	None	T98361 Strengthen partnerships with organizations that use or contribute free content, or are aligned with the WMF in the free-knowledge movement
Declined	Qgil	T96013 Identify Wikimedia's top technical partners
Resolved	Qgil	T97283 Plan to focus on the Developer audience
Resolved	None	T114017 Map current use of Wikimedia web APIs
Resolved	bd808	T102079 Metrics about the use of the Wikimedia web APIs
Duplicate	None	T117203 [WD] External usage KPI
Resolved	Addshore	T119071 Track the access of Wikidata data through the API
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Declined	Addshore	T117205 [WD] Partnerships usage KPI
Resolved	bd808	T108618 Publish detailed Action API request information to Hadoop
Resolved	Joe	T125084 MediaWiki monolog doesn't handle Kafka failures gracefully
Declined	None	T121913 Enable HHVM Repo Authoritative mode
Resolved	Joe	T103886 Translation cache exhaustion caused by changes to PHP code in file scope
Resolved	Joe	T119637 Update HHVM package to recent release
Resolved	Joe	T129467 HHVM 3.12 has a race-condition when starting up

Event Timeline

Joe created this task.Mar 10 2016, 12:01 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 10 2016, 12:01 PM

Rendering hhvms keep falling over like crazy. This is all reaction to icinga alerts on IRC basically:

13:20 bblack: mw1188: restarted hhvm (before alert hit IRC, was already pending in icinga)
13:19 bblack: mw1240: restarted hhvm
13:17 bblack: mw1122: restarted hhvm, killed stale procs from Jan24 looking like: /usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11;
13:15 bblack: restarting hhvm on mw1112
13:12 bblack: restarting hhvm on mw1248, mw1151
13:06 bblack: restarting hhvm on mw1251 ...
13:02 bblack: restarting hhvm on mw1258 ...
13:01 bblack: restarting hhvm on mw1243
12:59 bblack: restarting hhvm on mw1091 mw1249 mw1252 mw1242 ...
12:56 bblack: restarting hhvm on mw1217
12:01 _joe_: restarted hhvm on mw1122, stuck at startup

A wild guess: our hhvm-warmup job might have something to do with this - on mw1122 I found a hanging hhvm process created by the

/usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11;

so it *might* seem this has to do with deadlock accessing the jit from two separated processes

Mentioned in SAL [2016-03-10T18:13:13Z] <ori@tin> Synchronized php-1.27.0-wmf.15/includes/GlobalFunctions.php: wfShellExec() debug logging for T129467 (duration: 00m 28s)

Mentioned in SAL [2016-03-10T18:19:00Z] <ori@tin> Synchronized php-1.27.0-wmf.15/includes/GlobalFunctions.php: wfShellExec() debug logging for T129467 (take 2) (duration: 00m 26s)

In T129467#2107062, @Joe wrote:
A wild guess: our hhvm-warmup job might have something to do with this - on mw1122 I found a hanging hhvm process created by the
/usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11;
so it *might* seem this has to do with deadlock accessing the jit from two separated processes

I was able to reproduce this by just invoking PHP repeatedly with -r. The backtrace pointed to jemalloc profiling code as being the culprit. I looked to see if /etc/malloc.conf was present, and indeed it was a symlink to 'prof:true'. Once I removed the symlink, I could not reproduce the lock-ups.

In short: we can avoid running into this issue in prod by avoiding jemalloc memory profiling.

trace.out5 KBDownload

The hosts that had /etc/malloc.conf -> prof:true were

mw1122.eqiad.wmnet, mw1169.eqiad.wmnet, mw1015.eqiad.wmnet, and mw1107.eqiad.wmnet, which are exactly the ones that were repeatedly failing. I removed the symlink and restarted HHVM on each of those hosts.

Joe closed this task as Resolved.Mar 16 2016, 7:08 AM

	F3608849: trace.out
	Mar 11 2016, 1:40 AM

HHVM 3.12 has a race-condition when starting upClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

HHVM 3.12 has a race-condition when starting up
Closed, ResolvedPublic
Actions

Related Objects
Search...