Build / migrate to HHVM 3.18
Closed, ResolvedPublic

Description

HHVM 3.18 is the next LTS and now available in jessie-wikimedia/experimental

There were some stat_cache related bugs, which are fixed now. 3.18.2 is running on the mediawiki canaries, but the wider rollout is held back until after the DC switchover.

Related Objects

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 12:36 PM
elukey added a subscriber: elukey.Feb 15 2017, 12:57 PM

HHVM 3.18 now requires webscalesqlclient also for the standard mysql extension, so I've changed the tarball generation to no longer prune webscalesqlclient. DENABLE_ASYNC_MYSQL is now broken in general as well, if not set to ON it misses to link libsquangle as well.

HHVM 3.18 also needs a build on librb5-dev now.

This also needed an updated build of lz4 since the extension in 3.18 requires the LZ4_compress_default() function only introduced in lz4 r130.

With these additional changes, 3.18 now builds, but fails the hphp/test/quick/json_bigint.php test introduced in 626ecb4c38e8acbb52e43d043aeb26069c873bb3

I've temporarily backed it out to produce a build for further debugging, this might be related to our build using json-c (or a difference in versions)

3.18.1 packages are now built. Next I'll rebuild 3.18.1 in Debian unstable to narrow down whether the hphp/test/quick/json_bigint.ph test failure is related to the outdated json-c in jessie compared to what's in stretch/sid.

When building on Debian stretch with json-c 0.12.1 the same error occurs (and also an additonal test failure), will open a bug for that.

The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons, this doesn't affect us and backing out the test case is the correct interim fix (until upstream has fixed the test suite to only apply it for builds without DUSE_JSONC=ON).

HHVM extensions need to be rebuilt for the new 3.18 ABI.

hhvm-wikidiff2 and hhvm-luasandbox rebuilt without changes against the new 3.18 API. hhvm-tidy needed to be patched. Initially the build failed with

/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp: In member function ‘virtual void HPHP::TidyExtension::moduleInit()’:
/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:86:34: error: invalid conversion from ‘void* (*)(size_t, HPHP::type_scan::Index) {aka void* (*)(long unsigned int,
short unsigned int)}’ to ‘TidyMalloc {aka void* (*)(long unsigned int)}’ [-fpermissive]
     tidySetMallocCall(req::malloc);
                                  ^
In file included from /home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:20:0:
/usr/include/tidy/tidy.h:245:28: note: initializing argument 1 of ‘Bool tidySetMallocCall(TidyMalloc)’
 TIDY_EXPORT Bool TIDY_CALL tidySetMallocCall( TidyMalloc fmalloc );
                            ^
/home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:87:36: error: invalid conversion from ‘void* (*)(void*, size_t, HPHP::type_scan::Index) {aka void* (*)(void*, long
unsigned int, short unsigned int)}’ to ‘TidyRealloc {aka void* (*)(void*, long unsigned int)}’ [-fpermissive]
     tidySetReallocCall(req::realloc);
                                    ^
In file included from /home/jmm/rebuild/hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp:20:0:
/usr/include/tidy/tidy.h:247:28: note: initializing argument 1 of ‘Bool tidySetReallocCall(TidyRealloc)’
 TIDY_EXPORT Bool TIDY_CALL tidySetReallocCall( TidyRealloc frealloc );
                            ^
CMakeFiles/tidy.dir/build.make:57: recipe for target 'CMakeFiles/tidy.dir/ext_tidy.cpp.o' failed

This is due to internal changes in HHVM required by the garbage collection that got implemented. In 3.12 the req::malloc API looks like this:

void* malloc(size_t nbytes);
void* calloc(size_t count, size_t bytes);
void* realloc(void* ptr, size_t nbytes);

While in 3.18 it provides additional type information for the GC:

void* malloc(size_t nbytes,
             type_scan::Index tyindex = type_scan::kIndexUnknown);

void* calloc(size_t count, size_t bytes,
             type_scan::Index tyindex = type_scan::kIndexUnknown);

void* realloc(void* ptr,
              size_t nbytes,
              type_scan::Index tyindex = type_scan::kIndexUnknown);

Fortunately HHVM provides new functions for memory allocation with unknown index type, I switched the extension to them with the patch below: This fixes the build.

--- hhvm-tidy-0.1.3~jessie1/ext_tidy.cpp        2016-04-05 07:19:49.000000000 +0000
+++ hhvm-tidy-0.1.3~jessie2/ext_tidy.cpp        2017-03-07 13:00:14.308416908 +0000
@@ -83,8 +83,8 @@
   virtual void moduleInit() {
     HHVM_FE(tidy_repair_string);
     loadSystemlib();
-    tidySetMallocCall(req::malloc);
-    tidySetReallocCall(req::realloc);
+    tidySetMallocCall(req::malloc_noptrs);
+    tidySetReallocCall(req::realloc_noptrs);
     tidySetFreeCall(req::free);
   }
 } s_tidy_extension;

Mentioned in SAL (#wikimedia-operations) [2017-03-07T14:32:31Z] <moritzm> uploaded HHVM 3.18 builds of hhvm-tidy, hhvm-luasandbox and hhvm-wikidiff2 to the experimental section of apt.wikimedia.org (Bug: T158176)

Just relaying that translatewiki.net has been affected with https://github.com/facebook/hhvm/issues/7567 until I disabled stat_cache. As far as I know WMF has stat_cache enabled.

Reedy added a subscriber: Reedy.Mar 8 2017, 10:20 PM

Looks like we're getting a few fatals on beta cluster too...

Reedy added a comment.Mar 8 2017, 10:23 PM

Looks pretty much the same what @Nikerabbit is seeing

root@deployment-mediawiki04:/var/log/hhvm# cat /var/log/hhvm/stacktrace.16002.log
Host: deployment-mediawiki04
ProcessID: 16002
ThreadID: 139719568520960
ThreadPID: 16116
Name: unknown program
Type: Aborted
Runtime: hhvm
Version: 3.18.1+dfsg-1
DebuggerCount: 0

Assertion Failure: /tmp/buildd/hhvm-3.18.1+dfsg/hphp/runtime/base/program-functions.cpp:2321: void HPHP::hphp_memory_cleanup(): assertion `g_context.isNull() || !g_context->hasRequestEventHandlers()' failed.

# 0  000000000219e496
# 1  0000000000f3dda2
# 2  00007f135a3b7890
# 3  00007f1354a2c067
# 4  00007f1354a2d448
# 5  0000000000d494a8
# 6  00000000021323b6
# 7  0000000000fd64e5
# 8  000000000219837d
# 9  000000000117f24d
# 10 000000000117f939
# 11 0000000000d51f2f
# 12 0000000000d520fb
# 13 0000000000f81926
# 14 00007f135a3b0064
# 15 00007f1354adf62d

PHP Stacktrace:

root@deployment-mediawiki04:/var/log/hhvm#

Change 341916 had a related patch set uploaded (by Reedy):
[operations/puppet] hhvm.server.stat_cache = false

https://gerrit.wikimedia.org/r/341916

Reedy added a comment.Mar 8 2017, 10:30 PM

^ FWIW, not necessarily suggesting we merge that for production, but going to cherry pick onto the deployment puppetmaster for the time being to stop beta breaking

Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706

Reedy added a comment.Mar 9 2017, 5:33 PM

Unless things have changed, we had to enable StatCache or the performance just crawled down. Ori made it very early you can see for the details on T75706

Was this before the automated restart stuff? Do we still have that?

The assertion failure definitely happens on the beta cluster. I haven't found a good way to reproduce. Looking at the wfDebugLog I managed to find some URL that probably triggered some, but trying the same URL apparently does not trigger it.

Some errors are logged in https://logstash-beta.wmflabs.org/app/kibana but the last one is from March 8th 23:00 ??

Yesterday log on beta for hphp_memory_cleanup _type:hhvm

Link to last 24 hours: https://logstash-beta.wmflabs.org/goto/2ccd8afa9af2422479fa44611b07b29a

The beta log spam stopped because StatCache has been disabled via cherry pick of https://gerrit.wikimedia.org/r/#/c/341916/ . That hardly made any change to the instances CPU usage then they are not having much traffic.

Maybe we can reenable it and enable core dump on beta? Might help figure out what is happening (unlikely but still).

hashar updated the task description. (Show Details)

The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons, this doesn't affect us and backing out the test case is the correct interim fix (until upstream has fixed the test suite to only apply it for builds without DUSE_JSONC=ON).

This is fixed upstream with https://github.com/facebook/hhvm/commit/f7a1dae3d90179c10c03e749875aa8a6a5015605, I've cherrypicked the patch to the Debian hhvm git repository (on which our internal packages are based)

\o/ "mofarrell commented an hour ago: A fix is on its way."

FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17

I'll rebuild our 3.18.1 package with that patch.

Reedy added a comment.Mar 14 2017, 2:40 PM

FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17

I'll rebuild our 3.18.1 package with that patch.

(Y)

Mentioned in SAL (#wikimedia-operations) [2017-03-15T15:47:48Z] <moritzm> uploaded new HHVM 3.18 package with backported patch for stat_cache regression (T158176)

I guess we can upgrade hhvm on beta cluster and drop the revert patch from the beta cluster puppet master ( https://gerrit.wikimedia.org/r/#/c/341916/ ).

Reedy added a comment.Mar 15 2017, 4:48 PM

I guess we can upgrade hhvm on beta cluster and drop the revert patch from the beta cluster puppet master ( https://gerrit.wikimedia.org/r/#/c/341916/ ).

Both already done :)

MoritzMuehlenhoff closed this task as Resolved.Mar 16 2017, 8:38 AM

The crash hasn't happened again over the course of the last ~ 16 hours in Beta (and previously we've hit it about 10 times per hour on mediawiki04.deployment-prep.eqiad,wmnet). I've reported success to the HHVM developers and also asked them to consider this for inclusion into the 3.18.2 release.

MoritzMuehlenhoff reopened this task as Open.Mar 16 2017, 8:40 AM

Change 341916 abandoned by Reedy:
hhvm.server.stat_cache = false

https://gerrit.wikimedia.org/r/341916

The dead lock could be narrowed down to incorrect locking in stat_cache and was eventually fixed by upstream. I've built a new package which is now running on mw1261 (with stat_cache enabled) to validate whether the fix is sufficient.

Status update: The app servers mw1170-mw1184, mw1261-mw1265 are now running HHVM 3.18. It's working mostly fine there, we're seeing them crash every 2-3 days for T162586, but systemd restarts that reliably and I'm optimistic that we'll have a fix next week. The Treadmill deadlock is gone entirely.

In addition an image scaler (mw1293) and an API server (mw1189) have been upgraded and they're stable as well.

Also, one job runner (mw1161) was upgraded, it segfaulted five times yesterday and once today. Crashes are different (T165043 and T165051), at least for the crash in luasandbox it seems useful to first upgrade to a fix for T162685. This might just as well be related. I've downgraded the job runner back to 3.12 until a new 3.18 package is available next week.

In addition the new HHVM package has caused test failures in CI tests (T165074). The test will be unbroken by the fixed packages for T162685 as well, until then I've copied the 3.12 packages to experimental and I'm working with Antoine to have the CI tests run 3.12 again until a fix for T162685 has landed and has been verified to make the CI tests pass again.

Status update: With HHVM 3.18.2+wmf5 we now have a stable HHVM package. It has been rolled out on all appservers, all image scalers, most API servers and about one quarter of the job runners in eqiad (codfw is complete).

MoritzMuehlenhoff created subtask Restricted Task.Jun 13 2017, 8:49 AM

The migration to HHVM 3.18 is almost complete; the two remaining systems (terbium/wasat) will be updated tomorrow.

MoritzMuehlenhoff closed this task as Resolved.Jun 15 2017, 8:38 AM

All our jessie-based application servers (app servers, API servers, job runners and image scalers, script runners) are now upgraded to HHVM 3.18.2 (plus a few cherrypicked backports from master, I'll ask the HHVM developers to merge those into the next 3.18 LTS release so that people running mediawiki with the debs provided by HHVM have these fixed as well).

We found a number of bugs in HHVM (some of them were already present in 3.12), but they've all been fixed in the latest package.

Compared to before, HHVM is more stable than with 3.12; the deadlocks in HPHP::Treadmill are gone completely and over the course of the last 72 hours we only had a single segfault across all our 300 production servers (half of them being inactive in codfw, of course).