Build / migrate to HHVM 3.18
Open, Needs TriagePublic

Description

HHVM 3.18 is the next LTS and now available in jessie-wikimedia/experimental

There were some stat_cache related bugs, which are fixed now. 3.18.2 is running on the mediawiki canaries, but the wider rollout is held back until after the DC switchover.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 12:36 PM
elukey added a subscriber: elukey.Feb 15 2017, 12:57 PM

HHVM 3.18 now requires webscalesqlclient also for the standard mysql extension, so I've changed the tarball generation to no longer prune webscalesqlclient. DENABLE_ASYNC_MYSQL is now broken in general as well, if not set to ON it misses to link libsquangle as well.

HHVM 3.18 also needs a build on librb5-dev now.

This also needed an updated build of lz4 since the extension in 3.18 requires the LZ4_compress_default() function only introduced in lz4 r130.

With these additional changes, 3.18 now builds, but fails the hphp/test/quick/json_bigint.php test introduced in 626ecb4c38e8acbb52e43d043aeb26069c873bb3

I've temporarily backed it out to produce a build for further debugging, this might be related to our build using json-c (or a difference in versions)

3.18.1 packages are now built. Next I'll rebuild 3.18.1 in Debian unstable to narrow down whether the hphp/test/quick/json_bigint.ph test failure is related to the outdated json-c in jessie compared to what's in stretch/sid.

When building on Debian stretch with json-c 0.12.1 the same error occurs (and also an additonal test failure), will open a bug for that.

The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons, this doesn't affect us and backing out the test case is the correct interim fix (until upstream has fixed the test suite to only apply it for builds without DUSE_JSONC=ON).

HHVM extensions need to be rebuilt for the new 3.18 ABI.

hhvm-wikidiff2 and hhvm-luasandbox rebuilt without changes against the new 3.18 API. hhvm-tidy needed to be patched. Initially the build failed with

/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp: In member function ‘virtual void HPHP::TidyExtension::moduleInit()’:
/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:86:34: error: invalid conversion from ‘void* (*)(size_t, HPHP::type_scan::Index) {aka void* (*)(long unsigned int,
short unsigned int)}’ to ‘TidyMalloc {aka void* (*)(long unsigned int)}’ [-fpermissive]
     tidySetMallocCall(req::malloc);
                                  ^
In file included from /home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:20:0:
/usr/include/tidy/tidy.h:245:28: note: initializing argument 1 of ‘Bool tidySetMallocCall(TidyMalloc)’
 TIDY_EXPORT Bool TIDY_CALL tidySetMallocCall( TidyMalloc fmalloc );
                            ^
/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:87:36: error: invalid conversion from ‘void* (*)(void*, size_t, HPHP::type_scan::Index) {aka void* (*)(void*, long
unsigned int, short unsigned int)}’ to ‘TidyRealloc {aka void* (*)(void*, long unsigned int)}’ [-fpermissive]
     tidySetReallocCall(req::realloc);
                                    ^
In file included from /home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:20:0:
/usr/include/tidy/tidy.h:247:28: note: initializing argument 1 of ‘Bool tidySetReallocCall(TidyRealloc)’
 TIDY_EXPORT Bool TIDY_CALL tidySetReallocCall( TidyRealloc frealloc );
                            ^
CMakeFiles/tidy.dir/build.make:57: recipe for target 'CMakeFiles/tidy.dir/ext_tidy.cpp.o' failed

This is due to internal changes in HHVM required by the garbage collection that got implemented. In 3.12 the req::malloc API looks like this:

void* malloc(size_t nbytes);
void* calloc(size_t count, size_t bytes);
void* realloc(void* ptr, size_t nbytes);

While in 3.18 it provides additional type information for the GC:

void* malloc(size_t nbytes,
             type_scan::Index tyindex = type_scan::kIndexUnknown);

void* calloc(size_t count, size_t bytes,
             type_scan::Index tyindex = type_scan::kIndexUnknown);

void* realloc(void* ptr,
              size_t nbytes,
              type_scan::Index tyindex = type_scan::kIndexUnknown);

Fortunately HHVM provides new functions for memory allocation with unknown index type, I switched the extension to them with the patch below: This fixes the build.

--- hhvm-tidy-0.1.3~jessie1/ext_tidy.cpp        2016-04-05 07:19:49.000000000 +0000
+++ hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp        2017-03-07 13:00:14.308416908 +0000
@@ -83,8 +83,8 @@
   virtual void moduleInit() {
     HHVM_FE(tidy_repair_string);
     loadSystemlib();
-    tidySetMallocCall(req::malloc);
-    tidySetReallocCall(req::realloc);
+    tidySetMallocCall(req::malloc_noptrs);
+    tidySetReallocCall(req::realloc_noptrs);
     tidySetFreeCall(req::free);
   }
 } s_tidy_extension;

Mentioned in SAL (#wikimedia-operations) [2017-03-07T14:32:31Z] <moritzm> uploaded HHVM 3.18 builds of hhvm-tidy, hhvm-luasandbox and hhvm-wikidiff2 to the experimental section of apt.wikimedia.org (Bug: T158176)

Just relaying that translatewiki.net has been affected with https://github.com/facebook/hhvm/issues/7567 until I disabled stat_cache. As far as I know WMF has stat_cache enabled.

Reedy added a subscriber: Reedy.Mar 8 2017, 10:20 PM

Looks like we're getting a few fatals on beta cluster too...

Reedy added a comment.Mar 8 2017, 10:23 PM

Looks pretty much the same what @Nikerabbit is seeing

root@deployment-mediawiki04:/var/log/hhvm# cat /var/log/hhvm/stacktrace.16002.log
Host: deployment-mediawiki04
ProcessID: 16002
ThreadID: 139719568520960
ThreadPID: 16116
Name: unknown program
Type: Aborted
Runtime: hhvm
Version: 3.18.1+dfsg-1
DebuggerCount: 0

Assertion Failure: /tmp/buildd/hhvm-3.18.1+dfsg/hphp/runtime/base/program-functions.cpp:2321: void HPHP::hphp_memory_cleanup(): assertion `g_context.isNull() || !g_context->hasRequestEventHandlers()' failed.

# 0  000000000219e496
# 1  0000000000f3dda2
# 2  00007f135a3b7890
# 3  00007f1354a2c067
# 4  00007f1354a2d448
# 5  0000000000d494a8
# 6  00000000021323b6
# 7  0000000000fd64e5
# 8  000000000219837d
# 9  000000000117f24d
# 10 000000000117f939
# 11 0000000000d51f2f
# 12 0000000000d520fb
# 13 0000000000f81926
# 14 00007f135a3b0064
# 15 00007f1354adf62d

PHP Stacktrace:

root@deployment-mediawiki04:/var/log/hhvm#

Change 341916 had a related patch set uploaded (by Reedy):
[operations/puppet] hhvm.server.stat_cache = false

https://gerrit.wikimedia.org/r/341916

Reedy added a comment.Mar 8 2017, 10:30 PM

^ FWIW, not necessarily suggesting we merge that for production, but going to cherry pick onto the deployment puppetmaster for the time being to stop beta breaking

hashar added a subscriber: hashar.Mar 9 2017, 5:31 PM

Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706

Reedy added a comment.Mar 9 2017, 5:33 PM

Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706

Was this before the automated restart stuff? Do we still have that?

hashar added a comment.Mar 9 2017, 5:39 PM

The assertion failure definitely happens on the beta cluster. I haven't found a good way to reproduce. Looking at the wfDebugLog I managed to find some URL that probably triggered some, but trying the same URL apparently does not trigger it.

Some errors are logged in https://logstash-beta.wmflabs.org/app/kibana but the last one is from March 8th 23:00 ??

Yesterday log on beta for hphp_memory_cleanup _type:hhvm

Link to last 24 hours: https://logstash-beta.wmflabs.org/goto/2ccd8afa9af2422479fa44611b07b29a

hashar added a comment.Mar 9 2017, 5:43 PM

The beta log spam stopped because StatCache has been disabled via cherry pick of https://gerrit.wikimedia.org/r/#/c/341916/ . That hardly made any change to the instances CPU usage then they are not having much traffic.

Maybe we can reenable it and enable core dump on beta? Might help figure out what is happening (unlikely but still).

hashar edited the task description. (Show Details)

The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons, this doesn't affect us and backing out the test case is the correct interim fix (until upstream has fixed the test suite to only apply it for builds without DUSE_JSONC=ON).

This is fixed upstream with https://github.com/facebook/hhvm/commit/f7a1dae3d90179c10c03e749875aa8a6a5015605, I've cherrypicked the patch to the Debian hhvm git repository (on which our internal packages are based)

\o/ "mofarrell commented an hour ago: A fix is on its way."

FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17

I'll rebuild our 3.18.1 package with that patch.

Reedy added a comment.Mar 14 2017, 2:40 PM

FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17

I'll rebuild our 3.18.1 package with that patch.

(Y)

Mentioned in SAL (#wikimedia-operations) [2017-03-15T15:47:48Z] <moritzm> uploaded new HHVM 3.18 package with backported patch for stat_cache regression (T158176)

I guess we can upgrade hhvm on beta cluster and drop the revert patch from the beta cluster puppet master ( https://gerrit.wikimedia.org/r/#/c/341916/ ).

Reedy added a comment.Mar 15 2017, 4:48 PM

I guess we can upgrade hhvm on beta cluster and drop the revert patch from the beta cluster puppet master ( https://gerrit.wikimedia.org/r/#/c/341916/ ).

Both already done :)

MoritzMuehlenhoff closed this task as "Resolved".Mar 16 2017, 8:38 AM

The crash hasn't happened again over the course of the last ~ 16 hours in Beta (and previously we've hit it about 10 times per hour on mediawiki04.deployment-prep.eqiad,wmnet). I've reported success to the HHVM developers and also asked them to consider this for inclusion into the 3.18.2 release.

MoritzMuehlenhoff reopened this task as "Open".Mar 16 2017, 8:40 AM

Change 341916 abandoned by Reedy:
hhvm.server.stat_cache = false

https://gerrit.wikimedia.org/r/341916

The dead lock could be narrowed down to incorrect locking in stat_cache and was eventually fixed by upstream. I've built a new package which is now running on mw1261 (with stat_cache enabled) to validate whether the fix is sufficient.