Page MenuHomePhabricator

CI tests failing with segfault
Closed, ResolvedPublic

Description

For a few hours now, all (or at least, many) jenkins runs have been failing for the Wikibase repo. Example: https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-composer-jessie/2272/console

It's unclear when or why it's failing.

+ php -dzend.enable_gc=0 phpunit.php --log-junit /home/jenkins/workspace/mwext-testextension-hhvm-composer-jessie/log/junit-phpunit-allexts.xml --testsuite extensions
17:10:36 [Thu May 11 17:10:35 2017] [hphp] [5082:7f6468075200:0:000001] [f6ac36:fe24f3:fe50be:b56d6a:7f645fa68b45:b54119] Unable to set ResourceLimit.CoreFileSize to 8589934592: Operation not permitted (1)
17:10:36 Using HHVM 3.18.2 (5.6.99-hhvm)
17:10:50 PHPUnit 4.8.35 by Sebastian Bergmann and contributors.
17:10:50 
17:10:52 ...........................................................    59 / 13853 (  0%)
17:10:52 ...........................................................   118 / 13853 (  0%)

...everything looks good, until...

17:16:17 ........................................................... 13747 / 13853 ( 99%)
17:16:19 ........................................................... 13806 / 13853 ( 99%)
17:16:19 ...............................................
17:16:19 
17:16:19 Time: 5.72 minutes, Memory: 757.69MB
17:16:19 
17:16:19 OK, but incomplete, skipped, or risky tests!
17:16:19 Tests: 13853, Assertions: 109388, Skipped: 26, Risky: 39.
17:16:26 [Thu May 11 17:16:25 2017] [hphp] [5083:7f6468075200:0:000002] [] Lost parent, LightProcess exiting
17:16:26 [Thu May 11 17:16:25 2017] [hphp] [5087:7f6468075200:0:000002] [] Lost parent, LightProcess exiting
17:16:26 [Thu May 11 17:16:25 2017] [hphp] [5085:7f6468075200:0:000002] [] Lost parent, LightProcess exiting
17:16:26 [Thu May 11 17:16:25 2017] [hphp] [5086:7f6468075200:0:000002] [] Lost parent, LightProcess exiting
17:16:26 [Thu May 11 17:16:25 2017] [hphp] [5084:7f6468075200:0:000002] [] Lost parent, LightProcess exiting
17:16:26 /srv/deployment/integration/slave-scripts/bin/mw-run-phpunit-allexts.sh: line 23:  5082 Segmentation fault      (core dumped) php -dzend.enable_gc=0 phpunit.php --log-junit "$JUNIT_DEST" --testsuite extensions
17:16:26 Build step 'Execute shell' marked build as failure
17:16:26 Recording test results
17:16:31 Archiving artifacts
17:16:34 [PostBuildScript] - Execution post build scripts.
17:16:34 [PostBuildScript] Build is not success : do not execute script
17:16:34 Finished: FAILURE

@Reedy mentioned that this may be because of the new HHVM version. This is happening with HHVM 3.18.2 (5.6.99-hhvm). Might be a duplicate of T165043 or T165051.

Event Timeline

Not just Wikibase. Errors in VE-MW and MobileFrontend. :-( Can we revert the test for now to see if that fixes it?

Can someone create a simple repo case? Or at least a backtrace?

Can someone create a simple repo case? Or at least a backtrace?

Do we have a jessie hhvm host we can trivially run tests on? I can't ssh onto ci-trusty-wikimedia-* as I get Permission denied. integration-slave-jessie-1001 seems to only have hhvm-dev (which feels like a different bug)

This is at least High, as it's stopping merges into master in most repos.

hashar raised the priority of this task from High to Unbreak Now!.May 11 2017, 8:40 PM

That is caused by the upgrade of HHVM T158176: Build / migrate to HHVM 3.18. 3.18 has been uploaded to apt.wikimedia.org under jessie-wikimedia/main and hence the CI instances are now using it

$ apt-cache policy hhvm
hhvm:
  Installed: 3.18.2+dfsg-1+wmf2
  Candidate: 3.18.2+dfsg-1+wmf2
  Version table:
 *** 3.18.2+dfsg-1+wmf2 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

3.18 has been promoted in apt.wikimedia.org on 2017-05-10. From SAL:

2017-05-10 15:18 <moritzm> uploaded HHVM 3.18.2 and HHVM extensions to apt.wikimedia.org/main (previously only in experimental)

The Nodepool images are automatically refreshed at 14:14UTC every day. Hence they caught with HHVM 3.18 today on May 11th around 14:14.

On labnodepool1001.eqiad.wmnet in /var/log/nodepool/image.log, logs starting at 2017-05-11 14:19 UTC:

2017-05-11 14:19:44,076 INFO nodepool.image.build.wmflabs-eqiad.snapshot-ci-jessie:

    hhvm (3.12.14+dfsg-1+wmf1 => 3.18.2+dfsg-1+wmf2)
   hhvm (3.12.14+dfsg-1+wmf1 => 3.18.2+dfsg-1+wmf2)
   hhvm-dev (3.12.14+dfsg-1+wmf1 => 3.18.2+dfsg-1+wmf2)
   hhvm-luasandbox (2.0.12~jessie1 => 2.0.12~jessie2)
   hhvm-tidy (0.1.3~jessie1 => 0.1.3~jessie2)
   hhvm-wikidiff2 (1.4.1 => 1.4.1+wmf1)
Get:23 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main hhvm-tidy amd64 0.1.3~jessie2 [405 kB]
Get:24 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main hhvm-wikidiff2 amd64 1.4.1+wmf1 [38.5 kB]
Get:25 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main hhvm-luasandbox amd64 2.0.12~jessie2 [42.7 kB]
Get:26 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main hhvm amd64 3.18.2+dfsg-1+wmf2 [12.4 MB]
Get:54 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main hhvm-dev amd64 3.18.2+dfsg-1+wmf2 [5,005 kB]
Preparing to unpack .../hhvm-tidy_0.1.3~jessie2_amd64.deb ...
Unpacking hhvm-tidy (0.1.3~jessie2) over (0.1.3~jessie1) ...
Preparing to unpack .../hhvm-wikidiff2_1.4.1+wmf1_amd64.deb ...
Unpacking hhvm-wikidiff2 (1.4.1+wmf1) over (1.4.1) ...
Preparing to unpack .../hhvm-luasandbox_2.0.12~jessie2_amd64.deb ...
Unpacking hhvm-luasandbox (2.0.12~jessie2) over (2.0.12~jessie1) ...
Preparing to unpack .../hhvm_3.18.2+dfsg-1+wmf2_amd64.deb ...
Unpacking hhvm (3.18.2+dfsg-1+wmf2) over (3.12.14+dfsg-1+wmf1) ...
Preparing to unpack .../hhvm-dev_3.18.2+dfsg-1+wmf2_amd64.deb ...
Unpacking hhvm-dev (3.18.2+dfsg-1+wmf2) over (3.12.14+dfsg-1+wmf1) ...
Setting up hhvm (3.18.2+dfsg-1+wmf2) ...
Setting up hhvm-tidy (0.1.3~jessie2) ...
Setting up hhvm-wikidiff2 (1.4.1+wmf1) ...
Setting up hhvm-luasandbox (2.0.12~jessie2) ...
Setting up hhvm-dev (3.18.2+dfsg-1+wmf2) ...

The snapshots we have:

IDProviderImageHostnameVersionImage IDServer IDStateAge (hours)
1537wmflabs-eqiadsnapshot-ci-jessiesnapshot-ci-jessie-14945120421494512042f8c28074-e1aa-4703-b09f-b23fa4d4fc08806124d0-8b6c-4064-b3e9-9f561105acabready6.27
1535wmflabs-eqiadsnapshot-ci-jessiesnapshot-ci-jessie-1494425642149442564221671a4d-1d8a-4f2d-aa8f-a929bde2f3d81cbf9a1a-e634-42d2-ae39-8254d7c0e991ready30.23

Mentioned in SAL (#wikimedia-releng) [2017-05-11T20:43:40Z] <hashar> nodepool: delete today jessie image snapshot. It comes with HHVM 3.18 which segfault with MediaWiki/PHPUnit. Rolled back to snapshot-ci-jessie-1494425642 from 30 hours ago. T165074

@MoritzMuehlenhoff we should probably downgrade the HHVM version from Beta and CI and work on repro'ing elsewhere. This is preventing merges and is UBN.

Jessie instances are now being booted from snapshot-ci-jessie-1494425642 which should have the previous HHVM version.

What is left to do is to bring back HHVM 3.12 to jessie-wikimedia/main. Else tomorrow 3.18 will end up being installed again and breaks CI.

Ok, so a clean vagrant vm (with 4GB ram!), will segfault by running phpunit with no extensions

From gdb attached...

Continuing.
[New Thread 0x7f1b053ff700 (LWP 2337)]
[Thread 0x7f1b053ff700 (LWP 2337) exited]
[New Thread 0x7f1b053ff700 (LWP 2338)]
[Thread 0x7f1b053ff700 (LWP 2338) exited]

Program received signal SIGSEGV, Segmentation fault.
HPHP::UserFile::close (this=0x7f1afd461090) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/user-file.cpp:212
212	/tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/user-file.cpp: No such file or directory.
(gdb) bt
#0  HPHP::UserFile::close (this=0x7f1afd461090) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/user-file.cpp:212
#1  0x0000000001a74ec2 in HPHP::XMLReader::close (this=0x7f1afd411ec0) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/ext/xmlreader/ext_xmlreader.cpp:95
#2  0x0000000002133130 in HPHP::MemoryManager::sweep (this=0x7f1b25a8c840, this@entry=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/memory-manager.cpp:471
#3  0x00000000021324dd in HPHP::hphp_memory_cleanup () at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/program-functions.cpp:2322
#4  0x0000000000fd6605 in HPHP::hphp_session_exit () at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/program-functions.cpp:2370
#5  0x0000000000fda5df in HPHP::execute_command_line_end (xhprof=0, coverage=coverage@entry=true, program=0x7f1b0ee48d58 "tests/phpunit/phpunit.php")
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/program-functions.cpp:721
#6  0x0000000000fe2d95 in HPHP::execute_program_impl (argc=argc@entry=4, argv=argv@entry=0x7fff36a84068)
    at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/program-functions.cpp:1830
#7  0x0000000000fe50be in HPHP::execute_program (argc=argc@entry=4, argv=argv@entry=0x7fff36a84068) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/runtime/base/program-functions.cpp:1148
#8  0x0000000000b56d6a in main (argc=<optimized out>, argv=0x7fff36a84068) at /tmp/buildd/hhvm-3.18.2+dfsg/hphp/hhvm/main.cpp:83

We can't easily downgrade the HHVM package in the main repo, it's otherwise working fine in production and running on about 20 servers for more exposure. If that prevents merges I suggest to disable the failing test temporarily. Or alternatively we can add a new archive component with 3.12 tomorrow and configure the CI setup to use that for a while.

We should really use the new HHVM in testing first before going to production. If the tests are broken it means fix the tests/the underlying issue, not work around it by disabling the test and continuing to use the (now uncertain if that test actually found a bug) in production.

CI is fine now, we've downgraded there. But we should be in sync on this going forward.

hashar lowered the priority of this task from Unbreak Now! to High.

CI instances have been rollbacked to the last known snapshot which uses HHVM 3.12.14. I have confirmed it on a job triggered after the rollback. So CI is fine right now.

We need to rollback HHVM in jessie-wikimedia/main to prevent CI from catching 3.18 tomorrow at 14:14. Will check that with @MoritzMuehlenhoff .

Mentioned in SAL (#wikimedia-operations) [2017-05-11T20:57:55Z] <hashar> CI Phpunit jobs were segfaulting due to an upgrade of HHVM to 3.18. Got rolled back to 3.12 - T165074

The new HHVM version has been extensively tested on five canary servers in production for 5-6 week now. As per Reedy's backtrace from above it has in fact identified the crash we already knew https://phabricator.wikimedia.org/T162586 (reported upstream at https://github.com/facebook/hhvm/issues/7779 but so far there wasn't a better reproducer than "expose 3 days of Wikimedia production traffic" for it :-) So the test suite has proven to be really useful.

Going forward from here, we should extend CI to also cover the "experimental" component from our repo (at least for partial CI runs). That would have identified that much quicker.

Change 353533 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: pin HHVM packages to use experimental component

https://gerrit.wikimedia.org/r/353533

Change 353533 merged by Muehlenhoff:
[operations/puppet@production] contint: pin HHVM packages to use experimental component

https://gerrit.wikimedia.org/r/353533

@MoritzMuehlenhoff has pushed HHVM 3.12 to 'experimental' and I wrote a few rules to pin that version. The Nodepool instances are thus on 3.12:

jenkins@ci-jessie-wikimedia-660007:~$ apt-cache policy hhvm
hhvm:
  Installed: 3.12.14+dfsg-1+wmf1
  Candidate: 3.12.14+dfsg-1+wmf1
  Package pin: 3.12.14+dfsg-1+wmf1
  Version table:
     3.18.2+dfsg-1+wmf2 1002
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages
 *** 3.12.14+dfsg-1+wmf1 1002
          1 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/experimental amd64 Packages
        100 /var/lib/dpkg/status

So CI is definitely working and save (unless one magically change the version in "experimental"). The segfault fix is tracked via T162586

I've built new HHVM packages with a patch as proposed by upstream in https://github.com/facebook/hhvm/issues/7779, that avoids the segfault and makes the phpunit tests pass with 3.18 again. I'll upload the new packages later on, after that we can revert the HHVM 3.12 hack again.

Mentioned in SAL (#wikimedia-operations) [2017-05-15T13:27:45Z] <moritzm> uploaded HHVM 3.18.2+dfsg-1+wmf3 to apt.wikimedia.org (addresses segfault in XML reader (T162586, T165074)

Just saying this also happens in travis instances causing Wikibase travis tests to fail. https://travis-ci.org/wikimedia/mediawiki-extensions-Wikibase/jobs/232431659 if there is anything specific needs to be done (like using 3.17 instead of 3.18) please tell me to apply to travis config.

@Ladsgroup I'm not sure how that Travis setup is configured, but if you make it install HHVM 3.18.2+dfsg-1+wmf3, that should be fixed.

Change 358341 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: install HHVM from main

https://gerrit.wikimedia.org/r/358341

Change 358341 merged by Muehlenhoff:
[operations/puppet@production] contint: install HHVM from main

https://gerrit.wikimedia.org/r/358341

Mentioned in SAL (#wikimedia-releng) [2017-06-12T12:30:28Z] <hashar> nodepool: refreshing Jessie snapshot to upgrade HHVM from 3.12 to 3.18 - T167493 T165074