Page MenuHomePhabricator

HHVM 3.18 crashes in realloc() as exposed by luasandbox
Closed, ResolvedPublic

Description

I've upgraded a job runner to HHVM 3.18, but it segfaulted five times since then, apparently all related to the luasandbox:

Example backtrace:

(gdb) bt
#0  HPHP::BigHeap::freeBig (this=0x7fde75bfbfc8, ptr=0x7fdc9853ce90) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/memory-manager.cpp:1240
#1  0x00007fdf363e00ad in ?? () from /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212/luasandbox.so
#2  0x00007fdf361b6fde in ?? () from /usr/lib/x86_64-linux-gnu/liblua5.1-c++.so.0
#3  0x00007fdf361bb388 in ?? () from /usr/lib/x86_64-linux-gnu/liblua5.1-c++.so.0
#4  0x00007fdf361b4b55 in ?? () from /usr/lib/x86_64-linux-gnu/liblua5.1-c++.so.0
#5  0x00007fdf361b5911 in ?? () from /usr/lib/x86_64-linux-gnu/liblua5.1-c++.so.0
#6  0x00007fdf361ba61e in ?? () from /usr/lib/x86_64-linux-gnu/liblua5.1-c++.so.0
#7  0x00007fdf363e6c18 in ?? () from /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212/luasandbox.so
#8  0x0000000001d37681 in HPHP::ZendObjectStore::freeObject (this=0x7fde394e45b0, handle=101)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/ext_zend_compat/hhvm/zend-object-store.cpp:76
#9  0x00000000012df5d1 in HPHP::Native::nativeDataInstanceDtor (obj=0x7fdd0eb7a860, cls=<optimized out>)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/vm/native-data.cpp:138
#10 0x0000000001d099e2 in tvDecRefHelper (datum=<optimized out>, type=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:175
#11 tvDecRef (tv=<optimized out>, tv=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:206
#12 tvRefcountedDecRef (tv=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:212
#13 ~Variant (this=0x7fdccc5e5188, __in_chrg=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/type-variant.h:335
#14 ~RefData (this=0x7fdccc5e5180, __in_chrg=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/type-variant.h:1509
#15 release (this=0x7fdccc5e5180) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/ref-data.h:100
#16 decRefAndRelease (this=0x7fdccc5e5180) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/ref-data.h:108
#17 decRefRef (ref=0x7fdccc5e5180) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/ref-data.h:269
#18 _zval_ptr_dtor (zval_ptr=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/ext_zend_compat/php-src/Zend/zend_variables.cpp:55
#19 0x00007fdf363e58af in ?? () from /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212/luasandbox.so
#20 0x0000000001d37681 in HPHP::ZendObjectStore::freeObject (this=0x7fde394e45b0, handle=67)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/ext_zend_compat/hhvm/zend-object-store.cpp:76
#21 0x00000000012df5d1 in HPHP::Native::nativeDataInstanceDtor (obj=0x7fddad0729e0, cls=<optimized out>)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/vm/native-data.cpp:138
#22 0x0000000002129e88 in tvDecRefHelper (datum=<optimized out>, type=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:175
#23 tvDecRef (tv=<optimized out>, tv=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:206
#24 tvRefcountedDecRef (tv=0x7fde20e76648) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:212
#25 HPHP::ObjectData::releaseNoObjDestructCheck (this=0x7fde20e76600) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/object-data.cpp:171
#26 0x000000000212945a in tvDecRefHelper (datum=<optimized out>, type=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:175
#27 tvDecRef (tv=<optimized out>, tv=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:206
#28 tvRefcountedDecRef (tv=0x7fdd86187628) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:212
#29 HPHP::MixedArray::Release (in=0x7fdd86186a00) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/mixed-array.cpp:489
#30 0x0000000002130882 in tvDecRefHelper (datum=<optimized out>, type=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:175
#31 tvRefcountedDecRefHelper (datum=<optimized out>, type=<optimized out>) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:182
#32 tvSet (inTo=..., fr=...) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/tv-helpers.h:393
#33 HPHP::ObjectData::setProp (this=0x7fde7187f360, ctx=<optimized out>, key=0x7fdf2c211880, val=val@entry=0x7fde75bfa8c0,
    bindingAssignment=bindingAssignment@entry=false) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/object-data.cpp:1361
#34 0x00000000016b6ee5 in SetPropObj<(HPHP::KeyType)2> (val=0x7fde75bfa8c0, key=<optimized out>, instance=<optimized out>, ctx=<optimized out>)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/vm/member-operations.h:2850
#35 HPHP::jit::MInstrHelpers::setPropCOS (ctx=<optimized out>, base=<optimized out>, key=<optimized out>, val={ HPHP::KindOfNull })
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/vm/jit/minstr-helpers.h:291
#36 0x00000000043497ba in ?? ()
#37 0x00007fde71aff1b0 in ?? ()
#38 0x00000000047d465c in ?? ()
#39 0x00007fde71aff1e0 in ?? ()
#40 0x00000000047d44cc in ?? ()
#41 0x00007fde71aff220 in ?? ()
#42 0x00000000040910a6 in ?? ()
#43 0x00007fde71aff350 in ?? ()
#44 0x0000000004884c9e in ?? ()
#45 0x00007fde71aff3a0 in ?? ()
#46 0x0000000004d9343c in ?? ()
#47 0x00007fde71aff410 in ?? ()
#48 0x0000000004d931f1 in ?? ()
#49 0x00007fde71aff5d0 in ?? ()
#50 0x0000000004d14df2 in ?? ()
#51 0x00007fde71aff7b0 in ?? ()
#52 0x0000000004091376 in ?? ()
#53 0x00007fde71aff8f0 in ?? ()
#54 0x0000000004ce3a9f in ?? ()
#55 0x00007fde71aff940 in ?? ()
#56 0x0000000004ce16fc in ?? ()
#57 0x00007fde71affaf0 in ?? ()
#58 0x00000000048e4054 in ?? ()
#59 0x00007fde71affbe0 in ?? ()
#60 0x0000000004629227 in ?? ()
#61 0x000000000400004f in ?? ()
#62 0x00007fdd84c2ebf0 in ?? ()
#63 0x00007fde75bfa9f0 in ?? ()
#64 0x0000000001772a84 in HPHP::jit::enterTCImpl (start=<optimized out>, stashedAR=<optimized out>)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/vm/jit/unique-stubs.cpp:1421
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Event Timeline

Here's the backtrace after installing the lua debug symbols:
https://phabricator.wikimedia.org/P5423

It's unfortunate that our hhvm-luasandbox package strips debug symbols, we should probably add these to a separate package.

It's unfortunate that our hhvm-luasandbox package strips debug symbols, we should probably add these to a separate package.

According to https://wiki.debian.org/DebugPackage, that'll happen automatically with the version of debhelper in stretch. But jessie doesn't have a new enough version of debhelper.

The 'debian' branch of the mediawiki/php/luasandbox repo could use an update to pick up recent fixes, but at first glance I don't think any of them would affect this task.

I'll build an unstripped hhvm-luasandbox package tomorrow morning (can be done manually, we in fact don't have the automatic dbgsym packages from stretch)

Based on the surrounding functions in the stack trace, I'd guess frame #1 is most likely alloc.c line 108.

For frame #8, my best guess is luasandbox.c line 423, with the tail call at alloc.c line 53 getting optimized out.

Frame #20 is hard to guess. Maybe luasandbox.c line 485, based mostly on where luasandbox calls zval_ptr_dtor on a zval for a LuaSandbox object from something that seems like it might be called from HPHP::ZendObjectStore::freeObject.

None of that helps me figure out what might be going wrong though.

Since we're close to the weekend, I've downgraded the canary job runner (where this happened) to 3.12. There's another HHVM bug (T162586) which has now been narrowed down to a reproducer and it's quite likely that we'll be able to build fixed HHVM packages early next week. When that happens, I'll adapt hhvm-luasandbox to provide a separate -dbg package so that we have a proper backtrace if it still happens (it might be that the luasandbox crash is the fallout of the memory corruption from T162586).

This sort of thing is much easier if there is a reproducible test case. Maybe we could parse a few different articles using benchmarkParse.php to try to trigger a crash. Failing that, maybe we can try to extract the page name from a core dump, and get a test case that way. It's unlikely the stack trace alone will be enough, regardless of how many debug symbols you install, considering it is apparently a double free.

Once we have a test case, it should be pretty easy to see what is happening, by tracing the refcount increment/decrement operations. For example, you can set a hardware breakpoint on the refcount in question, and use gdb's "commands" to show a short backtrace each time the refcount changes.

The crashes are not related/fallout of the segfault fixed in T162586. I checked our servers which have been upgraded to the version and we've seen three crashes related to luasandbox with the new version, two of which seem to be some kind of double free and one which is different: https://people.wikimedia.org/~jmm/luasandbox/

I've also rebuilt the luasandbox source package to add a new hhvm-luasandbox-dbg, will install that on the migrated servers in a bit.

MoritzMuehlenhoff renamed this task from HHVM 3.18 crash on job runner / luasandbox to HHVM 3.18 crashes in luasandbox.May 17 2017, 1:19 PM
MoritzMuehlenhoff changed the visibility from "Public (No Login Required)" to "WMF-NDA (Project)".May 19 2017, 12:05 PM

I've caught three additional backtraces on hosts which have been upgraded to use new build of hhvm-luasandbox-dbg: They're available at https://phabricator.wikimedia.org/P5459

They all contain "Module:zh/data/nan-pron", which per Google seems related to Chinese Wiktionary. I've been unable to extract the processed URL from the core dump so far, but the segfault on mw1181 occured on 2017-05-19 05:12 and I've copied the URLs processed five minutes preceeding the crash to https://phabricator.wikimedia.org/P5459

Does that help in reproducing the crash? The core file in question has been copied to my home on mw1181 (to prevent the tidy cron from cleaning it up in a few days).

It's difficult to interpret this core dump, because any access to TLS variables fails with "Cannot find executable file `/usr/lib/debug/.build-id/a6/497566c73070f1db15f856584d02dffb69b4ec.debug' in dynamic linker's load module list". Apparently this is because of https://bugzilla.redhat.com/show_bug.cgi?id=1080656 , which was fixed in 2014 shortly after the branch point for the gdb version present in jessie. @Joe suggests building a gdb package with the relevant patch included.

The patch from https://bugzilla.redhat.com/show_bug.cgi?id=1080656 fixes TLS access for statically linked binaries, this won't make a difference for our HHVM build? Instead I've upgraded gdb on mw1181 to 7.11.1 from jessie-backports. If that's more useful, we can install this via puppet on all the mw* servers.

The patch from https://bugzilla.redhat.com/show_bug.cgi?id=1080656 fixes TLS access for statically linked binaries, this won't make a difference for our HHVM build? Instead I've upgraded gdb on mw1181 to 7.11.1 from jessie-backports. If that's more useful, we can install this via puppet on all the mw* servers.

That fixed it, thanks. It was described as only affecting statically linked binaries, but I don't know if that was true. May as well install it everywhere if that's easy -- often the stack is corrupted by a crash, and HHVM stores hardly anything in true globals, but from thread-local storage you can access the whole request state, including the VM's stack.

Ok, great. I'll puppetise a change to use the backports gdb version on all mw* servers.

gdb on the mw* servers has been upgraded to 7.11.2

There's another Luasandbox related core file on mw1191 (/var/tmp/core/core.mw1191.hhvm.12080.1495939942), which is also related to Chinese Wiktionary (title=Module:yue-pron/hom&action=edit)

On my local setup I reliably have segmentation faults. The host is Debian jessie with:

hhvm3.18.2+dfsg-1+wmf4
hhvm-dbg3.18.2+dfsg-1+wmf4
hhvm-dev3.18.2+dfsg-1+wmf4
hhvm-luasandbox2.0.12~jessie3
hhvm-luasandbox-dbg2.0.12~jessie3
hhvm-tidy0.1.3~jessie2

mediawiki/core and Scribunto @master

$wgScribuntoDefaultEngine = 'luasandbox';

/etc/hhvm/php.ini
hhvm.debug.core_dump_report = true
hhvm.debug.core_dump_report_directory = /tmp
hhvm.resource_limit.core_file_size = 8589934592

hhvm.log.level = Warning
hhvm.log.always_log_unhandled_exceptions = true
hhvm.log.runtime_error_reporting_level = 8191
hhvm.mysql.typed_results = false

hhvm.dynamic_extension_path = /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212
hhvm.dynamic_extensions[tidy.so] = tidy.so
hhvm.dynamic_extensions[luasandbox.so] = luasandbox.so

hhvm.enable_zend_compat = true
hhvm.jit = false
hhvm.jit_pseudomain = false

hhvm.log.native_stack_trace = true
hhvm.log.use_syslog = false

hhvm.perf_pid_map = false

hhvm.repo.central.path =
hhvm.repo.eval.mode = central
hhvm.repo.journal = memory
hhvm.repo.local.mode = --
/etc/security/limits.d/hashar-core.conf
hashar  soft core 8388608

I run the test suite using:

ulimit -c 8388608; hhvm tests/phpunit/phpunit.php --verbose --testsuite extensions

A few gdb traces with the related stacktrace.<pid>.log files: P5502 P5503 P5504

They are all three near identical but different from this task summary.

If I disable luasandbox.so, the related tests are not run and the suite pass just fine.

To limit the number of tests to run, one can pass to PHPUnit --group LuaSandbox.

With hhvm.jit = true there is some bad memory errors via jemalloc P5505 P5506 P5507 all segfaulting at je_arena_dalloc_bin_locked.

On my local setup I reliably have segmentation faults. The host is Debian jessie with:

hhvm3.18.2+dfsg-1+wmf4
hhvm-dbg3.18.2+dfsg-1+wmf4
hhvm-dev3.18.2+dfsg-1+wmf4
hhvm-luasandbox2.0.12~jessie3
hhvm-luasandbox-dbg2.0.12~jessie3
hhvm-tidy0.1.3~jessie2

Using mediawiki-vagrant, I get the same versions of hhvm and hhvm-luasandbox and can reliably reproduce the segfaults as well.

After a lot of experimentation, I determined that the segfaults go away if [[https://phabricator.wikimedia.org/diffusion/ELUA/browse/master/tests/engines/LuaCommon/LuaInterpreterTest.php;0d41404fd8dcd86a0d3db34772e84e43f8299c21$130-146|Scribunto_LuaInterpreterTest::testTestMemoryLimit()]] is commented out. So it seems to be something about that test in particular that's breaking things.

Playing with it some more, I made some more observations:

  • If I change the test to do 100 strings with 100000 repeats each (instead of 10 with 1000000 repeats), the segfaults go away.
  • If I raise the memory limit enough that the test fails, the segfaults still remain.
  • If I change alloc.c line 73 in the luasandbox PHP extension so that the memory limit is effectively ignored, the segfaults still remain.
  • If I change luasandbox to use the C library malloc, realloc, and free instead of PHP's emalloc, erealloc, and efree, the segfaults go away.

All this points to a bug of some sort in HHVM's implementation of emalloc, erealloc, and/or efree.

Glancing through the HHVM sources, I see it has two different methods of allocating memory, one for "small" chunks and one for "big" chunks. The definition of "small" and "big" depends on HPHP::kMaxSmallSize. I had a hunch that it gets confused if you allocate a "small" chunk, reallocate it "big", then free it, and/or vice versa.

On that hunch, I adjusted luasandbox to use emalloc/memcpy/efree instead of erealloc. The segfaults went away. Then I adjusted it to use do that only if the reallocation crosses the HPHP::kMaxSmallSize boundary, and the segfaults still went away.

if((osize <= HPHP::kMaxSmallSize) != (nsize <= HPHP::kMaxSmallSize)){
        nptr = emalloc(nsize);
        memcpy(nptr, ptr, nsize < osize ? nsize : osize);
        efree(ptr);
} else {
        nptr = erealloc(ptr, nsize);
}

A little more experimentation establishes that the segfault only manifests when the erealloc takes it from "big" to "small", not in the other direction. This does make sense on further examination, since HHVM internally does a malloc/memcpy/free when reallocing from "small"-sized chunks.

I'd say let's report this to HHVM upstream next, after all it all points to a generic bug in HHVM memory handling. We already ran into an earlier crash which was reproducible with mediawiki-vagrant and the HHVM devs tracked it down/fixed it using that setup (https://github.com/facebook/hhvm/issues/7779)

Sounds like a good idea to me.

They'll probably want to verify and add tests, but the TL;DR: my suspicion is that it's breaking because an allocation that was allocated with the "big" allocator's malloc (either directly or via the malloc/memcpy/free in an earlier realloc) is getting passed to the "small" allocator's free, which somehow corrupts the state of one or both of the allocators.

hashar added a project: Upstream.
hashar moved this task from Backlog to Reported Upstream on the Upstream board.

This has been acknowledged by one of the HHVM developers: "I haven't had a chance to look at this yet but it's probably a legit memory corruption bug."

If we don't see any movement on the HHVM end in the next days, there's still the possibility to work around this in hhvm-luasandbox by replacing erealloc() calls with emalloc/memcpy/efree.

MoritzMuehlenhoff renamed this task from HHVM 3.18 crashes in luasandbox to HHVM 3.18 crashes in realloc() as exposed by luasandbox.Jun 13 2017, 8:25 AM
MoritzMuehlenhoff closed this task as Resolved.
MoritzMuehlenhoff changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".

I've built new HHVM packages (3.18.2+wmf5) which include the upstream fix from https://github.com/facebook/hhvm/commit/22e7acdb22b3d14420f82c7f59c843bd4e6cafcd. This addresses the bug, the luasandbox crash hasn't recoccured. There's another crash in the luasandbox (but far less frequently), for which I'll open a new task.